Tsinghua Science and Technology
Volume 19

Issue 2

Article 10

2014

A Fully Pipelined Probability Density Function Engine for Gaussian
Copula Model
Huabin Ruan
Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China

Xiaomeng Huang
Center for Earth System Science, Tsinghua University, Beijing 100084, China

Haohuan Fu
Center for Earth System Science, Tsinghua University, Beijing 100084, China

Guangwen Yang
Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China

Follow this and additional works at: https://tsinghuauniversitypress.researchcommons.org/tsinghuascience-and-technology
Part of the Computer Sciences Commons, and the Electrical and Computer Engineering Commons

Recommended Citation
Huabin Ruan, Xiaomeng Huang, Haohuan Fu et al. A Fully Pipelined Probability Density Function Engine
for Gaussian Copula Model. Tsinghua Science and Technology 2914, 19(2): 195-202.

This Research Article is brought to you for free and open access by Tsinghua University Press: Journals Publishing.
It has been accepted for inclusion in Tsinghua Science and Technology by an authorized editor of Tsinghua
University Press: Journals Publishing.

TSINGHUA SCIENCE AND TECHNOLOGY
ISSNll1007-0214ll10/12llpp195-202
Volume 19, Number 2, April 2014

A Fully Pipelined Probability Density Function Engine
for Gaussian Copula Model
Huabin Ruan , Xiaomeng Huang, Haohuan Fu, and Guangwen Yang
Abstract: The Gaussian Copula Probability Density Function (PDF) plays an important role in the fields of finance,
hydrological modeling, biomedical study, and texture retrieval. However, the existing schemes for evaluating the
Gaussian Copula PDF are all computationally-demanding and generally the most time-consuming part in the
corresponding applications. In this paper, we propose an FPGA-based design to accelerate the computation of
the Gaussian Copula PDF. Specifically, the evaluation of the Gaussian Copula PDF is mapped into a fully-pipelined
FPGA dataflow engine by using three optimization steps: transforming the calculation pattern, eliminating constant
computations from hardware logic, and extending calculations to multiple pipelines. In the experiments on 10 typical
large-scale data sets, our FPGA-based solution shows a maximum of 1870 times speedup over a well-tuned singlecore CPU-based solution, and 610 times speedup over a well-optimized parallel quad-core CPU-based solution
when processing two-dimensional data.
Key words: Gaussian Copula probability density function; FPGA; pipeline; optimization

1

Introduction

A copula is a joint Cumulative Distribution
Function (CDF) defined on a d -dimensional unit
cube Œ0; 1d such that every marginal distribution
function is uniform in the range of [0, 1]. It is
used for describing the dependence between random
variables. According to the different characteristics of
the marginal distribution function, copulas are divided
into different families such as Gaussian Copula,
Archimedean Copula, and Empirical Copula, and
among these various copulas, the Gaussian Copula is
considerably popular because of its simple algorithm
 Huabin
Ruan
and
Guangwen
Yang
are
with
Department of Computer Science and Technology,
Tsinghua University, Beijing 100084, China.
E-mail:
ruanhuabin@mail.tsinghua.edu.cn; ygw@tsinghua.edu.cn.
 Xiaomeng Huang and Haohuan Fu are with Center for Earth
System Science, Tsinghua University, Beijing 100084, China.
E-mail: hxm@tsinghua.edu.cn; haohuan@tsinghua.edu.cn.
 To whom correspondence should be addressed.
Manuscript received: 2013-02-27; revised: 2013-12-17;
accepted: 2013-12-30

and straightforward dependence structure[1-3] .
The Gaussian Copula is a distribution over the unit
cube Œ0; 1d . For a given correlation matrix ˙ 2 Rd d ,
the Gaussian Copula C Ga ; Œ0; 1d ! Œ0; 1 for random
vector X D ŒX1 ; X2 ;    ; Xd  can be written as

C Ga .u; ˙ / D ˚˙ ˚ 1 .u1 /; ˚ 1 .u2 /;    ; ˚ 1 .ud /
(1)
1
where ˚
denotes the Inverse CDF (ICDF) of
a standard normal, ˚˙ represents the CDF of a
multivariate normal distribution with mean vector zero
and covariance matrix equal to the correlation matrix
˙ . Furthermore, the Probability Density Function
(PDF) of the Gaussian Copula c Ga can generally be
written as
0
0
1T
˚ 1 .u1 /
B 1B
1
C
::
exp B
c Ga .u; ˙ / D
A 
:
1
@ 2@
2
˙j
j˙
1
˚ .ud /
0
11
˚ 1 .u1 /


B
CC
::
˙ 1 I @
(2)
AA
:
1
˚ .ud /
where I denotes a d -dimensional identity matrix.

196

The Gaussian Copula has become a widely used
multivariate modeling tool in many fields where
the multivariate dependence is of considerable
interest[4] , such as finance, hydrological modeling,
biomedical study,
and wavelet-based texture
modeling[5-8] . However, when users apply the
Gaussian Copula in their specific tasks to achieve
satisfactory performance, they also have to tolerate
the high computational complexity and the related
time cost. This is because the mathematical model
of the Gaussian Copula, including either its CDF or
the corresponding PDF, consists of plenty of timeconsuming computation operations[9] . For instance,
Lasmar and Berthoumieu[3] adopted a multivariate
Gaussian Copula PDF to develop the model of texture
images under the task of texture retrieval, and they
reported that the use of Gaussian Copula PDF-based
models under CPU architecture needed 14 times more
runtime to measure the similarity between a query
image and a candidate image than the use of the most
common modeling approach. From this perspective,
a design of Gaussian Copula is urgently demanded,
particularly in the scenarios that require a fast response
on large-sized data sets (e.g., real-time decision making
on financial assets portfolio or quick texture retrieval
from a large-scale data source).
Further, in recent years, we have seen increased
research efforts on the acceleration of Gaussian
Copula-related applications. However, most work was
devoted to the evaluation of the CDF rather than the
PDF[10-13] . In Ref. [10], Curran described a Monte
Carlo method for pricing multi-name credit derivatives
according to the standard Gaussian Copula Model for
large baskets in conventional CPU architecture and
obtained maximum of 50 times in the computation
speed as compared to the original design without
using the Gaussian Copula Model. In Refs. [11] and
[12], Kaganov et al. presented hardware designs of
the one-factor and multi-factor Gaussian Copula CDF
models on the Xilinx Virtex 5 (XC5VSX50T) FPGA
for the task of Collateralized Debt Obligations (CDOs)
pricing, and achieved 64 and 71 times speedups over
a single-core CPU version. In Ref. [13], Weston
et al. accelerated the Gaussian Copula-based valuing
tranches of CDOs on a Maxeler acceleration system
(i.e., a 10-node MaxRack configured with MaxNode1821 compute nodes), and achieved an improvement
of more than 30 times on the performance per cubic
foot and per Watt over 8-core Intel Xeon processors. In

Tsinghua Science and Technology, April 2014, 19(2): 195-202

contrast, the acceleration research on the Gaussian
Copula PDF algorithm is relatively rare.
Therefore, this paper is devoted to develop an
optimized Gaussian Copula PDF evaluation scheme,
which is able to achieve both high computation
efficiency and low resource cost. Specifically, an
FPGA-based Gaussian Copula PDF design is proposed
for this purpose. It is a fast Gaussian Copula PDF
evaluation engine running on the FPGA and capable
of handling all the time-consuming computation
operations in a fully-pipelined manner. During
the process of deploying the originally CPUfriendly Gaussian Copula PDF algorithm on the
FPGA architecture, three optimization strategies are
conducted, as follows:
 By transforming the calculation pattern of the
Gaussian Copula PDF algorithm, we significantly
reduce the consumption of the computational
resources.
 By eliminating constant computations from hardware
logic, we achieve a better computational resource
balance between the host and the accelerator.
 By extending calculations to multiple pipelines in
one pass, we effectively exploit the utilization of
the computational resources and achieve a significant
performance improvement.
Overall, by applying the above optimizations, using
one Virtex-6 SX475T FPGA, we achieve 1870 times
speedup over a single-core CPU solution and 610 times
speedup over a quad-core CPU solution. Furthermore,
the performance of our solution can be easily scaled for
future FPGA devices with more hardware sources since
all instances processed by the Gaussian Copula PDF are
independent.

2

Target Hardware

We use Maxeler’s MAX3 acceleration card to
implement our Gaussian Copula PDF engine. The
MAX3 acceleration card consists of one Virtex6 SX475T FPGA and 24 GB DDR3 onboard
memory. Figure 1 illustrates the architecture of the used
Maxeler acceleration system. The host application runs
on the conventional CPU and manages the interaction
between the host and the FPGA accelerators. On the
FPGA, kernels are the hardware designs implementing
the arithmetic and logical computations needed for an
algorithm. The manager is the collective term for the
FPGA logic that orchestrates the data flow between the
kernel and the off-chip I/O.

Huabin Ruan et al.: A Fully Pipelined Probability Density Function Engine for Gaussian Copula Model

197

want to make a CPU-friendly algorithm FPGA-friendly,
some adjustments are definitely required.
In this section, we develop an FPGA-friendly
Gaussian Copula PDF design based on the
Maxeler MAX3 acceleration card described in the
previous section to reach the maximum acceleration
performance, by using the following three optimization
strategies.
3.1

Fig. 1

Architecture of Maxeler acceleration card.

Dividing an application into kernels and a
manager lets us separate the computation from
the communication. This is beneficial because it
enables deeply pipelined kernels without the control
flow constraints, which is key to achieving high
performance. Managers use a streaming model for
the off-chip I/O to the PCI Express and DDR3 RAM
memory and are optimized to achieve highly use of
available bandwidth in the off-chip communication
channels, allowing the kernels to run at peak
performance.
Additionally,
the MAX3 acceleration card
is also attached with a compiling platform
named
MaxCompiler
provided
by
Maxeler
Technologies. Using MaxCompiler, we only need
to program the Gaussian Copula PDF algorithm in
Java, and the MaxCompiler will compile and build the
Java code into a configurable bitstream file through
the vendor’s CAD software that are integrated in
MaxCompiler. The generated bitstream file can be
loaded into the Maxeler acceleration card at runtime.

3

Hardware Design of Gaussian Copula
PDF Algorithm

For the conventional CPU architectures, which are
featured by a limited number of computational units,
operations are generally computed at different points
in time on the same functional units (i.e., “computing
in time”). In contrast, the FPGA is a reconfigurable
architecture with plenty of computational units, and
computation is completely laid out spatially on the chip
(i.e., “computing in space”). Given this difference in
the working mechanism between CPU and FPGA, if we

Transforming the calculation pattern

As Eq. (2) shows, the computation complexity of the
Gaussian Copula PDF is mainly dominated by the
evaluation of the function ˚ 1 ./, which is called d
times throughout the evaluation. Moreover, the number
of calls will continue to increase proportionally with
the increase of the amount of input data. In our design,
we implement the ICDF ˚ 1 ./ based on Acklam’s
method shown in Algorithm 1. In comparison to other
conventional algorithms for the evaluation of ˚ 1 ./,
Acklam’s algorithm is more optimized and requires
less arithmetic operations although it still needs more
than 60 arithmetic operations to achieve an accurate
result. Hence, if we can reduce the computational
resource cost of ˚ 1 ./ as much as possible, the global
resource cost of the Gaussian Copula PDF will be
reduced consequently.
As shown in Algorithm 1, this pseudo code of
˚ 1 ./ performs corresponding calculations based on
different regions (lower region, central region, and
upper region) that ui belongs to by using three IF-ELSE
structures. Moreover, the calculations inside these IFELSE structures are similar to each other: Each
structure includes two degree-5 polynomial evaluations
and one division, except that the polynomial coefficients
are different in each structure. The pseudo code works
well for the CPU architecture and will not incur
any resource cost on duplicated operations. Since
each ui belongs to one of these three regions, only
one calculation path will be executed. However, the
situation is completely different if we map the same
pseudo code onto the FPGA architecture. Specifically,
we use the similar polynomial evaluation and division
units for each of these three IF-ELSE structures.
In order to deal with the cost of the duplicated
arithmetic units, we use the multiplexer (MUX) to
select the corresponding coefficients and share the two
polynomial evaluation units and the one division unit
among the three different regions. More specifically,
looking at the numerators of these three division

198
Algorithm 1 Acklam ICDF algorithm with ui as input and xi
as output.
1: a.1/ D 3:969683028665376e C 01;
2: a.2/ D 2:209460984245205e C 02;
3: a.3/ D 2:759285104469687e C 02;
4: a.4/ D 1:383577518672690e C 02;
5: a.5/ D 3:066479806614716e C 01;
6: a.6/ D 2:506628277459239e C 00;
7: b.1/ D 5:447609879822406e C 01;
8: b.2/ D 1:615858368580409e C 02;
9: b.3/ D 1:556989798598866e C 02;
10: b.4/ D 6:680131188771972e C 01;
11: b.5/ D 1:328068155288572e C 01;
12: c.1/ D 7:784894002430293e
03;
13: c.2/ D 3:223964580411365e
01;
14: c.3/ D 2:400758277161838e C 00;
15: c.4/ D 2:549732539343734e C 00;
16: c.5/ D 4:374664141464968e C 00;
17: c.6/ D 2:938163982698783e C 00;
18: d.1/ D 7:784695709041462e
03;
19: d.2/ D 3:224671290700398e
01;
20: d.3/ D 2:445134137142996e C 00;
21: d.4/ D 3:754408661907416e C 00;
ui low;
22: ui low D 0:02425; ui high D 1
23: if 0 < ui < ui low then
24:
q D sqrt. 2  log.ui //;
25:
xi D .....c.1/  q C c.2//  q C c.3//  q C c.4//  q C
c.5//  q C c.6//=....d.1/  q C d.2//  q C d.3//  q C
d.4//  q C 1/;
26: end if
27: if ui low  ui  ui high then
28:
q D ui 0:5; r D q  q;
29:
xi D .....a.1/  r C a.2//  r C a.3//  r C a.4//  r C
a.5//  r C a.6//  q=.....b.1/  r C b.2//  r C b.3// 
r C b.4//  r C b.5//  r C 1;
30: end if
31: if ui high < ui < 1 then
32:
q D sqrt. 2  log.1 ui //;
33:
xi D .....c.1/  q C c.2//  q C c.3//  q C c.4// 
q C c.5//  Cc.6//=....d.1/  q C d.2//  q C d.3// 
q C d.4//  q C 1/;
34: end if
35: return xi ;

operations in Algorithm 1 (see lines 25, 29, and
33), if we ignore the multiplication by ui in the
low-region and high-region calculations, they are
polynomials of degree 5. Therefore, we can multiplex
the coefficients. Then, the additional multiplication by
ui in the central region’s numerator can be matched
by a multiplication by 1 in the low-region’s and the
high-region’s numerators. Next, the denominators have
different degrees, but this can be easily handled by
adding 0 as the top-degree coefficient for the low-

Tsinghua Science and Technology, April 2014, 19(2): 195-202

region’s and the high region’s denominators. Thus, after
multiplexing these coefficients, the new calculation
process of Acklam’s ICDF algorithm is shown in
Algorithm 2, which contains only one division and two
polynomials, leading to an FPGA-friendly algorithm.
In Algorithm 2, lines 6-8 correspond to the selection
of polynomial coefficients for the computation of the
division’s numerator; lines 10-12 correspond to the
selection of polynomial coefficients for the computation
of the division’s denominator. Lines 9-13 correspond
to the calculation process of the numerator and the
denominator for division. Compared with Algorithm 1,
the new Acklam’s ICDF algorithm (see Algorithm 2)
will lead to a significant saving of the computational
resources since the calculations of six polynomials and
two divisions are reduced to two and one, respectively.
3.2

Eliminating constant
hardware logic

computation

from

As Eq. (2) shows, there are three matrix-related
calculations for every d -dimensional input vector u D
1
˙ j 2 , (2) ˙ 1 , and (3) (˙
˙ 1
Œu1 ; u2 ;    ; ud : (1) j˙
I). However, the correlation matrix ˙ is rarely
changed for a specific application scenario. Therefore,
in a given scenario, these three matrix-related
calculations are calculated once and then used for the
numerous computations afterwards (called “constant
computation”). The use of FPGA hardware logic to
implement these constant computations is not ideal
Algorithm 2 Optimized Acklam ICDF algorithm with input
ui and output xi .
1: islow D ui < ui low; isupper D ui > ui high;
2: isbnd D islowkisupper; iscent ral D isbnd ;
3: pp D .isupper/‹.1:0
ui / W ui ;
4: q D .iscent ral/‹.ui
0:5/ W sqrt. 2:0log.pp//;
5: r D qq; xi D iscent ral‹r W qI mult D iscent ral‹q W 1;
6: for i nti D 1I i  6I C C i do
7:
num cŒi D iscent ral‹aŒi W cŒi;
8: end for
9: numerator D mult  .num cŒ6 C xi  .num cŒ5 C
xi  .num cŒ4 C xi  .num cŒ3 C xi  .num cŒ2 C xi 
num cŒ1/////;
10: for i nti D 5I i  1I
i do
11:
denom cŒi D iscent ral‹bŒi W d Œi 1;
12: end for
13: denomi nator D ....denom cŒ1  xi C denom cŒ2/ 
x Cdenom cŒ3/xi Cdenom cŒ4/xi Cdenom cŒ5/
xi C 1;
14: result D numerator=denomi nator;
15: xi D isupper‹
result W result ;

Huabin Ruan et al.: A Fully Pipelined Probability Density Function Engine for Gaussian Copula Model

because these matrix-related calculations are not only
extremely resource consuming but also probably slower
than a well-optimized CPU implementation. We show
the consumption of the hardware logic units of these
matrix-related calculations in Tables 1-3. From these
tables, we can see that the consumption of hardware
logic units is proportional to the dimensions of the
correlation matrix. When the dimensions are up to 18,
the consumption of hardware logic units exceed the
maximum available hardware resources in our FPGA
platform; in this case, it will be impossible to map the
Gaussian Copula PDF model to the FPGA hardware
logic.
Therefore, for these constant matrix-related
calculations, our proposed idea is to perform them
on the CPU host instead of on the FPGA accelerator. In
1
˙ j 2 and ˙ 1 on the
order to effectively compute j˙
CPU, we employ the LU decomposition method
introduced by Poole[14] instead of the extremely
time-consuming direct method. Since the determinant
of the correlation matrix ˙ is greater than zero in
the Gaussian Copula PDF shown in Eq. (2), the LU
Table 1 Hardware resources consumption for 6˙ j1=2 , ˙ 1 , ˙ 1 I in algorithm
dimensional calculation of j˙
of Gaussian Copula PDF Model with the Xilinx Vitex-6
SX475T FPGA.
Resource
Lookup Tables (LUTs)
Flip-Flops (FFs)
Block RAMs (BRAMs)
DSPs

Available
297 600
595 200
1064
2016

Used / Percent (%)
63 232 / 21.65
125 610 / 21.10
126 / 11.84
437 / 21.68

Table 2 Hardware resources consumption for 12˙ j1=2 , ˙ 1 , ˙ 1 I in algorithm
dimensional calculation of j˙
of Gaussian Copula PDF Model with the Xilinx Vitex-6
SX475T FPGA.
Resource
Lookup Tables (LUTs)
Flip-Flops (FFs)
Block RAMs (BRAMs)
DSPs

Available
297 600
595 200
1064
2016

Used / Percent (%)
19 475 / 64.00
39 897 / 65.67
246 / 23.12
1398 / 69.34

Table 3 Hardware resources consumption for 18˙ j1=2 , ˙ 1 , ˙ 1 I in algorithm
dimensional calculation of j˙
of Gaussian Copula PDF Model with the Xilinx Vitex-6
SX475T FPGA.
Resource
Lookup Tables (LUTs)
Flip-Flops (FFs)
Block RAMs (BRAMs)
DSPs

Available
297 600
595 200
1064
2016

Used / Percent (%)
312 798 / 105.11
622 899 / 104.65
367 / 34.49
2278 / 113.00

199

decomposition can be successfully applied to ˙ [15] .
In the LU decomposition, the correlation matrix is
transformed into the product of two matrices:
˙ DLU
(3)
where L denotes a lower triangular matrix with
elements only on the diagonal and below. U denotes
an upper triangular matrix with elements only on the
diagonal and above.
Then, we can use the decomposed matrix L and U to
solve the linear sets as follows:
˙  X D .L  U/  X D B
(4)
by first solving matrix Y such that
LYDB

(5)

and then solving matrix X such that
UXDY

(6)

Note that ˙ ; L; U; X; Y; and B are all d by d matrices,
and X denotes the solution matrix for the linear
equations shown in Eq. (4). The advantage of breaking
up one linear set into two successive ones is that
the solutions of a triangular set of equations are
considerably trivial and can be solved effectively[15] .
Given all of the above, if we replace B with an
identity matrix I, then ˙ 1 can be calculated by solving
the following linear equations:
˙ XDLUXDI
(7)
where the solution matrix X is the inverse matrix of the
correlation matrix ˙ and X in Eq. (7) can be solved
effectively after applying the LU decomposition on ˙ .
Further, based on the L and U matrices, the
determinant of the correlation matrix ˙ can also be
computed directly as
!
d
d
Y
Y
˙j D
li i 
ui i
(8)
j˙
i D1

i D1

where li i denotes the diagonal elements of matrix L. ui i
denotes the diagonal elements of matrix U.
3.3

Extending calculation to multiple pipelines

Up to now, based on the optimizations described
in Section 3.1 and Section 3.2, we have obtained
a relatively low computational resource cost for the
implementation of the Gaussian Copula PDF. Our next
step is to deploy the current Gaussian Copula PDF onto
the FPGA chip in the form of multiple pipelines so as
to improve the acceleration performance as much as
possible. As we know, the performance of an algorithm
is proportional to the number of parallel pipelines in
FPGA. In the best case, suppose an algorithm is not

Tsinghua Science and Technology, April 2014, 19(2): 195-202

200

confronted with the memory bottleneck, then a design
with N pipelines running in parallel in FPGA will
achieve N times speedup over a single pipeline if there
are N pipelines running in parallel in FPGA. Therefore,
the current problem then becomes the assignment of
N . Given a certain FPGA chip, the number of parallel
pipelines that we can obtain depends on the size
of the computational resource required by the target
algorithm. In general, the less the resource requirement
is, the higher the number of pipelines is. In our case,
taking into account that the computational resource
required by our Gaussian Copula PDF is variable
along with the dimension size of the input vector, we
determine the number of pipelines on the basis of the
dimensions of the input vectors.
Here, we take the 6-dimensional input vectors as
an example and show the usage of the computational
resources at the current stage in Table 4 for a 6dimensional input vector u D Œu1 ; u2 ;    ; u6 . From
Table 4, we can observe that with a 6-dimensional
vector as the input data, the current Gaussian Copula
PDF implementation only costs 20% or less of the
FPGA resources with Xilinx Vitex-6 SX475T FPGA,
implying that we can deploy multiple pipelines (four
pipelines in our case) to extend the calculation running
in parallel. A four-pipeline calculation is illustrated in
Fig. 2.

4

Experimental

4.1

Experimental settings

For the purpose of performance comparison, we have
also implemented two different well-optimized CPUbased implementations of the Gaussian Copula PDF
algorithm, including (1) a Single-Thread CPU (S-CPU)
version, (2) a Multi-Thread CPU (M-CPU) version that
uses four cores. The S-CPU and M-CPU versions are
deployed in a PC workstation powered by a 2.93 GHz
quad-core Intel i7 CPU with 4 GB DDR3 memory. The
FPGA-based implementation is deployed on a Maxeler
Table 4 Resources usage for Gaussian Copula PDF
implementation after the optimizations described in Section
3.1 and Section 3.2 for a 6-dimensional input vector with the
Xilinx Vitex-6 SX475T FPGA.
Resource
Lookup Tables (LUTs)
Flip-Flops (FFs)
Block RAMs (BRAMs)
DSPs

Available
297 600
595 200
1064
2016

Used / Percent (%)
57 649 / 19.37
75 829 / 12.74
28 / 2.63
315 / 15.63

Fig. 2 Four-parallel pipeline calculation with FPGA
architecture.

acceleration card with a Xilinx Vitex-6 SX475T FPGA
running at 175 MHz and 24 GB DDR3 onboard
memory. For more detailed information regarding the
acceleration card, please refer to Section 2.
All three of the above mentioned implementations of
the Gaussian Copula PDF algorithm are tested against
10 different data sets of different dimensions, which
are generated randomly following the example of data
sets from the real world adopted in Ref. [16]. The
dimensions of these data sets range from 2 to
20. Limited by the quantity of hardware resources
provided by Virtex-6 SX475T FPGA on one MAX3
card, the maximum dimension that our system can
handle is 20. If more hardware resources are available,
the maximum dimension can be extended. Note that
the amount of computation to be performed within
the Gaussian Copula PDF is proportional to the
dimension length. Therefore, we maintain the amount
of data instances at 1 billion in every data set for the
performance comparison.
Moreover, two measures are calculated for evaluating
the FPGA-based design’s efficiency, namely the
throughput (i.e., number of processed instances
per second) and the increase in the computational
speed. The throughput is calculated on average through
multiple runs. In our experiments, we calculate the
average result of the throughput of 20 runs of our
algorithms.
4.2

Results and discussion

We first present the throughput by the numbers of
instances processed in every second for different

Huabin Ruan et al.: A Fully Pipelined Probability Density Function Engine for Gaussian Copula Model

versions in Table 5 and then illustrate the increase in
the computational speed of the FPGA version over the
other versions in Fig. 3.
As shown in Table 5, the throughput of the FPGAbased design is significantly higher than that of the SCPU-based and the M-CPU-based solutions no matter
which data set is used. In the best case, we tested
through 20 runs, the FPGA-based solution can solve
more than 1.7 billion .1:71  109 / instances per second
in a two-dimensional scenario. As our FPGA runs at
175 MHz with 10 pipelines running in parallel, the
system is able to process 10 instances per clock cycle
without being confronted with memory bottlenecks. In
addition, as the dimension increases from 2 to 20, the
throughput declines. This is because that the larger
dimension requires more computational resources;
therefore fewer pipelines can be constructed (see last
column of Table 5). For instance, one can observe
that as the dimension ranges from 12 to 20, only one
Table 5 Performance of the Gaussian Copula PDF
implementations on S-CPU, M-CPU, and FPGA.
Data set
ID

Dimension

1
2
3
4
5
6
7
8
9
10

2
4
6
8
10
12
14
16
18
20

S-CPU
.105 /
9.14
6.64
5.66
4.38
4.24
3.98
3.89
3.73
3.56
3.41

Performance
M-CPU FPGA
.106 / .108 /
2.80
17.10
2.15
10.02
1.82
6.84
1.50
5.04
1.34
3.38
1.33
1.70
1.34
1.72
1.31
1.71
1.26
1.69
1.14
1.68

Number of
FPGA
pipelines
10
6
4
3
2
1
1
1
1
1

Note: “Performance” means the number of processed instances
per second.

Fig. 3

Speedup of the FPGA version over CPU versions.

201

pipeline can be constructed. This is because when the
dimension is greater than 12, the hardware logic unit
cost will be more than 50% in our FPGA platform,
which implies that only one pipeline can be constructed
for the algorithm of the Gaussian Copula PDF model.
We can observe from Fig. 3 that our FPGA-based
solution is significantly more efficient than the welloptimized CPU-based solutions in the case of the tested
data sets. For instance, it achieves 1870 times speedup
over an S-CPU-based solution and 610 times more
computational speed than the M-CPU-based solution
for the two-dimensional scenario. The reason behind
the high speedup of our FPGA-based design is that we
effectively utilize the FPGA computational resources by
transforming the inverse cumulative function ˚ 1 ./’s
calculation pattern suited for the FPGA architecture and
˙ j1=2 , ˙ 1 ,
moving the constant computations like j˙
1
˙
and .˙
I/ to the host. This enables us to spend
relatively low resource costs on forming the hardware
logic for the Gaussian Copula PDF algorithm and
finally to deploy more than one parallel pipeline (10
pipelines in the best case) on the FPGA, which further
improves the performance of the Gaussian Copula
PDF. Moreover, if more computational resources on
future FPGA platform being available, we can deploy
considerably more pipelines for this Gaussian Copula
PDF algorithm, which will result in even more
significant acceleration.

5

Conclusions

In this paper, we presented our FPGA-based solution for
the Gaussian Copula PDF, which is a widely used but
computationally demanding algorithm. We restructured
the work flow of the original algorithm with algorithmic
transformations in order to obtain a pipeline-friendly
algorithm. For the performance comparison, a series of
experiments were conducted on 10 large-scale data sets
of different dimensions. Experimental results indicate
that the proposed FPGA-based design has significantly
higher computation efficiency than the other two
CPU-based implementations in the cases of all the
considered data sets. For a two-dimensional scenario,
it achieved 1870 times speedup over the S-CPU-based
implementation running the original Gaussian Copula
PDF and 610 times speedup over an M-CPU-based
implementation. We believe that the FPGA platform
has considerable potential for accelerating the Gaussian
Copula PDF algorithm.

202

Tsinghua Science and Technology, April 2014, 19(2): 195-202

Acknowledgements
This work was supported in part by the National
Natural Science Foundation of China (Nos. 61303003,
41374113, and 41375102), the National High-Tech
Research and Development (863) Program of China
(Nos. 2011AA01A203 and 2013AA01A208), and the
National Key Basic Research and Development (973)
Program of China (No. 2014CB347800). We would also
like to thank Maxeler Technologies for providing the MAX
devices.

References
[1]

[2]

[3]

[4]
[5]
[6]

[7]

Y. Malevergne and D. Sornette, Testing the Gaussian copula
hypothesis for financial assets dependences, Quantitative
Finance, vol. 3, no. 4, pp. 231-250, 2003.
P. Song, M. Li, and Y. Yuan, Joint regression analysis of
correlated data using gaussian copulas, Biometrics, vol. 65,
no. 1, pp. 60-68, 2008.
N.-E. Lasmar and Y. Berthoumieu, Gaussian copula
multivariate modeling for texture image retrieval using
wavelet transforms, in Proc. the 2009 International
Conference on Acoustics, Speech and Signal Processing,
IEEE, 2009, pp. 1045-1048.
D. Li, On default correlation: A copula function approach,
The journal of Fixed Income, vol. 9, no. 4, pp. 43-45, 2000.
U. Cherubini, E. Luciano, and W. Vecchiato, Copula
Methods in Finance. Wiley, 2004.
G. Escarela and J. Carriere, Fitting competing risks with an
assumed copula, Statistical Methods in Medical Research,
vol. 12, no. 4, pp. 333-349, 2003.
C. Genest and A. Favre, Everything you always wanted to
know about copula modeling but were afraid to ask, Journal

[8]

[9]
[10]

[11]

[12]

[13]

[14]
[15]

[16]

of Hydrologic Engineering, vol. 12, no. 4, pp. 347-368,
2007.
Y. Stitou, N.-E. Lasmar, and Y. Berthoumieu, Copulas
based multivariate gamma modeling for texture
classification, in Acoustics, Speech and Signal Processing,
2009. ICASSP 2009. IEEE International Conference on,
2009, pp. 1045-1048.
R. B. Nelsen, An Introduction of Copulas. Springer-Verlag,
2006.
M. Curran, Efficient Monte Carlo for credit derivatives
under factor-copula models, http://papers.ssrn.com/sol3/
papers.cfm?abstract˙id=908882, 2013.
A. Kaganov, Hardware acceleration of Monte-Carlo
structural financial instrument pricing using a Gaussian
Copula models, Master degree dissertation, Department
of Electrical and Computer Engineering, University of
Tornoto, Canada, 2008.
A. Kaganov, A. Lakhany, and P. Chow, FPGA acceleration
of multifactor CDO pricing, ACM Transactions on
Reconfigurable Technology and Systems, vol. 4, no. 2, p. 20,
2011.
S. Weston, J.-T. Marin, J. Spooner, O. Pell, and O. Mencer,
Accelerating the computation of portfolios of tranched
credit derivatives, in 2010 IEEE Workshop on High
Performance Computational Finance, 2010, pp. 1-8.
D. Poole, Linear Algebra: A Modern Introduction.
Brooks/Cole Publishing Company, 2010.
W. Press, S. Teukolsky, W. Vetterling, and B. Flannery,
Numerical Recipes: The Art of Scientific Computing.
Cambridge University Press, 2007.
D. Wang, S. Rachev, and F. Fabozzi, Pricing tranches of a
cdo and a cds index: Recent advances and future research,
Risk Assessment, pp. 263-286, 2008.

Huabin Ruan is a postdoctoral research
fellow in the Department of Computer
Science and Technology at Tsinghua
University. His research interests include
algorithmic acceleration on reconfigurable
architecture like FPGA, GPU, MIC,
distributed service, and storage. Huabin
Ruan received his PhD degree in computer
science from Tsinghua University, China in 2013. He is a
member of IEEE.

Haohuan Fu is an associate professor
in the Center of Earth System Science
at Tsinghua University.
His research
interests include HPC in earth and
environmental
sciences,
computer
architectures, performance optimizations,
and programming tools in parallel
computing. Haohuan Fu received his PhD
degree in computing from Imperial College London in 2009. He
is a member of IEEE.

Xiaomeng Huang is an associate professor
in the Center of Earth System Science at
Tsinghua University. His main research
focuses on the algorithmic optimization
and performance acceleration of the Ocean
Models based on heterogeneous CPU,
GPU and MIC platform. Xiaomeng Huang
received his PhD degree in computer
science from Tsinghua University, China in 2007. He is a
member of IEEE.

Guangwen Yang is a professor in the
Department of Computer Science and
Technology at Tsinghua University. He
is the director of the Institute of High
Performance Computing at Tsinghua
University. His research interests include
the parallel algorithm, cloud computing
and the earth system model. Guangwen
Yang received his PhD degree in computer science from Harbin
Institute of Technology in 1996. He is a member of IEEE.

