Parallelized Kendall's Tau Coefficient Computation via SIMD Vectorized
  Sorting On Many-Integrated-Core Processors by Liu, Yongchao et al.
ar
X
iv
:1
70
4.
03
76
7v
1 
 [c
s.D
C]
  1
2 A
pr
 20
17
Parallelized Kendall’s Tau Coefficient Computation via
SIMD Vectorized Sorting On Many-Integrated-Core
Processors
Yongchao Liua,∗, Tony Pana, Oded Greena, Srinivas Alurua,∗
aSchool of Computational Science & Engineering, Georgia Institute of Technology, Atlanta,
GA 30332, USA
Abstract
Pairwise association measure is an important operation in data analytics. Kendall’s
tau coefficient is one widely used correlation coefficient identifying non-linear re-
lationships between ordinal variables. In this paper, we investigated a parallel
algorithm accelerating all-pairs Kendall’s tau coefficient computation via single
instruction multiple data (SIMD) vectorized sorting on Intel Xeon Phis by tak-
ing advantage of many processing cores and 512-bit SIMD vector instructions.
To facilitate workload balancing and overcome on-chip memory limitation, we
proposed a generic framework for symmetric all-pairs computation by building
provable bijective functions between job identifier and coordinate space. Perfor-
mance evaluation demonstrated that our algorithm on one 5110P Phi achieves
two orders-of-magnitude speedups over 16-threaded MATLAB and three orders-
of-magnitude speedups over sequential R, both running on high-end CPUs. Be-
sides, our algorithm exhibited rather good distributed computing scalability
with respect to number of Phis. Source code and datasets are publicly available
at http://lightpcc.sourceforge.net.
Keywords: Pairwise correlation; Kendall’s tau coefficient; all-pairs
computation; many integrated core; Xeon Phi
1. Introduction
Identifying interesting pairwise association between variables is an important
operation in data analytics. In bioinformatics and computational biology, one
typical application is to mine gene co-expression relationship via gene expression
data, which can be realized by query-based gene expression database search [1]
∗Corresponding author
Email addresses: yliu@cc.gatech.edu (Yongchao Liu), tpan7@gatech.edu (Tony Pan),
ogreen@gatech.edu (Oded Green), aluru@cc.gatech.edu (Srinivas Aluru)
1Preliminary work was presented in the 28th International Symposium on Computer Ar-
chitecture and High Performance Computing, Los Angeles, USA, 2016
Preprint submitted to Journal of Parallel and Distributed Computing June 11, 2018
or gene co-expression network analysis [2]. For gene expression database search,
it targets to select the subject genes in the database that are co-expressed
with the query gene. One approach is to first define some pairwise correla-
tion/dependence measure over gene expression profiles across multiple samples
(gene expression profiles for short) and then rank query-subject gene pairs by
their scores. For gene co-expression networks, nodes usually correspond to genes
and edges represent significant gene interactions inferred from the association
of gene expression profiles. To construct a gene co-expression network, all-
pairs computation over gene expression profiles is frequently conducted based
on linear (e.g. [3] [4] [5]) or non-linear (e.g. [6] [7] [8]) co-expression measures.
A variety of correlation/dependence measures have been proposed in the lit-
erature and among them, Pearson’s product-moment correlation coefficient [9]
(or Pearson’s r correlation) is the most widely used correlation measure [10].
However, this correlation coefficient is only applicable to linear correlations. In
contrast, Spearman’s rank correlation coefficient [11] (or Spearman’s ρ coeffi-
cient) and Kendall’s rank correlation coefficient [12] (or Kendall’s τ coefficient)
are two commonly used measures for non-linear correlations [13]. Spearman’s
ρ coefficient is based on Pearson’s r coefficient but applies to ranked variables,
while Kendall’s τ coefficient tests the association between ordinal variables.
These two rank-based coefficients were shown to play complementary roles in
the cases when Pearson’s r is not effective [14]. Among other non-linear mea-
sures, mutual information [15] [16] [17], Euclidean distance correlation [18] [19],
Hilbert-Schmidt information criterion [20], and maximal information criterion
[21] are frequently used as well. In addition, some unified frameworks for pair-
wise dependence assessments were proposed in the literature (e.g. [22]).
Kendall’s τ coefficient (τ coefficient for short) measures the ordinal cor-
relation between two vectors of ordinal variables. Given two ordinal vectors
u = {u1, u2, ..., un} and v = {v1, v2, ..., vn}, where variables ui and vi are both
ordinal (0 ≤ i < n), the τ coefficient computes the correlation by counting
the number of concordant pairs nc and the number of discordant pairs nd by
treating ui and vi as a joint ordinal variable (ui, vi). For the τ coefficient, a
pair of observations (ui, vi) and (uj , vj), where i 6= j, is deemed as concordant
if ui > uj and vi > vj or ui < uj and vi < vj , and discordant if ui > uj and
vi < vj or ui < uj and vi > vj . Note that if ui = uj or vi = vj , this pair is
considered neither concordant nor discordant.
In our study, we will consider two categories of τ coefficient, namely Tau-a
(denoted as τA) and Tau-b (denoted as τB). τA does not take into account tied
elements in each vector and is defined as
τA =
nc − nd
n0
(1)
where n0 = n(n − 1)/2. If all elements in each vector are distinct, we have
nc + nd = n0 and can therefore re-write Equation (1) as
τA =
n0 − 2nd
n0
= 1− 2nd
n0
(2)
2
As opposed to τA, τB makes adjustments for ties and is computed as
τB =
nc − nd√
(n0 − n1)(n0 − n2)
(3)
where n1 =
∑
i u
′
i(u
′
i − 1)/2 and n2 =
∑
i v
′
i(v
′
i − 1)/2. u′i (v′i) denotes the
cardinality of the i-th group of ties for vector u (v). Within a vector, each
distinct value defines one tie group and this value acts as the identifier of the
corresponding group. The cardinality of a tie group is equal to the number of
elements in the vector whose values are identical to the identifier of the tie group.
If all elements in either vector are distinct, τB will equal τA as n1 = n2 = 0.
This indicates τA is a special case of τB and an implementation of τB will cover
τA inherently. In addition, from the definitions of τA and τB, we can see that
the computation of the τ coefficient between u and v is commutable.
Besides the values of correlation coefficients, some applications need to calcu-
late P -value statistics to infer statistical significance between variables. For this
purpose, one approach is permutation test [23]. However, a permutation test
may need a substantial number of pairwise τ coefficient computation [24] even
for moderately large n, thus resulting in prohibitively long times for sequential
execution. In the literature, parallelizing pairwise τ coefficient computation has
not yet been intensively explored. One recent work is from Wang et al. [25],
which accelerated the sequential τ coefficient computation in R [26] based on
Hadoop MapReduce [27] parallel programming model. In [25], the sequential
all-pairs τ coefficient implementation in R was shown extremely slow on large-
scale datasets. In our study, we further confirmed this observation through our
performance assessment (refer to section 5.2).
In this paper, we parallelized all-pairs τ coefficient computation on Intel
Xeon Phis based on Many-Integrated-Core (MIC) architecture, the first work
accelerating all-pairs τ coefficient computation on MIC processors to the best
of our knowledge. This work is a continuation from our previous paralleliza-
tion of all-pairs Pearson’s r coefficient on Phi clusters [28] and further enriches
our LightPCC library (http://lightpcc.sourceforge.net) targeting parallel
pairwise association measures between variables in big data analytics. In this
work, we have investigated three variants, namely the na¨ıve variant, the generic
sorting-enabled (GSE) variant and the vectorized sorting-enabled (VSE) vari-
ant, built upon three pairwise τ coefficient kernels, i.e. the na¨ıve kernel, the GSE
kernel and the VSE kernel, respectively. Given two ordinal vectors u and v of n
elements each, the na¨ıve kernel enumerates all possible pairs of joint variables
(ui, vi) (0 ≤ i < n) to obtain nc and nd, resulting in O(n2) time complexity. In
contrast, both the GSE and VSE kernels take sorting as the core and manage
to reduce the time complexity to O(n log n).
Given m vectors of n elements each, the overall time complexity would be
O(m2n2) for the na¨ıve variant and O(m2n logn) for the GSE and VSE vari-
ants. The VSE variant enhances the GSE one by exploiting 512-bit wide single
instruction multiple data (SIMD) vector instructions in MIC processors to im-
plement fast SIMD vectorized pairwise merge of sorted subarrays. Furthermore,
3
to facilitate workload balancing and overcome on-chip memory limitation, we
investigated a generic framework for symmetric all-pairs computation by pio-
neering to build a provable, reversible and bijective relationship between job
identifier and coordinate space in a job matrix.
The performance of our algorithm was assessed using a collection of real
whole human genome gene expression datasets. Our experimental results demon-
strates that the VSE variant performs best on both the multi-threaded CPU
and Phi systems, compared to the other two variants. We further compared
our algorithm with the all-pairs τ coefficient implementations in the widely
used MATLAB [29] and R [26], revealing that our algorithm on a single 5110P
Phi achieves up to 812 speedups over 16-threaded MATLAB and up to 1,166
speedups over sequential R, both of which were benchmarked on high-end CPUs.
In addition, our algorithm exhibited rather good distributed computing scala-
bility with respect to number of Phis.
2. Intel Many-Integrated-Core (MIC) Architecture
Intel MIC architecture targets to combine many Intel processor cores onto
a single chip and has already led to the release of two generations of MIC pro-
cessors. The first generation is code named as Knights Corner (KNC) and
the second generation code named as Knights Landing (KNL) [30]. KNC is
a PCI Express (PCIe) connected coprocessor that must be paired with Intel
Xeon CPUs. KNC is actually a shared-memory computer [31] with full cache
coherency over the entire chip and running a specialized Linux operating sys-
tem over many cores. Each core adopts an in-order micro-architecture and has
four hardware threads offering four-way simultaneous multithreading. Besides
scalar processing, each core is capable of vectorized processing from a newly
designed vector processing unit (VPU) featuring 512-bit wide SIMD instruction
set architecture (ISA). For KNC, each core has only one VPU and this VPU is
shared by all active hardware threads running on the same core. Each 512-bit
vector register can be split to either 8 lanes with 64 bits each or 16 lanes with 32
bits each. Note that the VPU does not support legacy SIMD ISAs such as the
Streaming SIMD extensions (SSE) series. As for caches, each core has separate
L1 instruction and data caches of size 32 KB each, and a 512 KB L2 cache
interconnected via a bidirectional ring bus with the L2 caches of all other cores
to form a unified shared L2 cache over the chip. The cache line size is 64 bytes.
In addition, two usage models can be used to invoke KNC: offload model and
native model, where the former relies on compiler pragmas/directives to offload
highly-parallel parts of an application to KNC, while the latter treats KNC as
symmetric multiprocessing computers. While primarily focusing on KNC Phis
in this work, we note that our KNC-based implementations can be easily ported
onto KNL processors, as KNL implements a superset of KNC instruction sets.
We expect that our implementations will be portable to future Phis as well.
4
Table 1: A list of notions used
Notion Description
m number of vectors
n number of elements per vector
u vector u of n elements, likewise for v
ui i-th element of vector u, likewise for vi
nc number of concordant variable pairs
nd number of discordant variable pairs
n0 n(n− 1)/2
n1
∑
i u
′
i(u
′
i − 1)/2, u′i is size of the i-th tie group in u
n2
∑
i v
′
i(v
′
i − 1)/2, v′i is size of the i-th tie group in v
n3
∑
iwi(wi − 1)/2, wi is size of the i-th joint tie group for u and v
τA (nc − nd)/n0
τB (nc − nd)/
√
(n0 − n1)(n0 − n2)
SX a sorted subarray
|SX | length of the sorted subarray SX
vX a 512-bit SIMD vector with 16 32-bit integer lanes
3. Pairwise Correlation Coefficient Kernels
For the τ coefficient, we have investigated three pairwise τ coefficient kernels:
the na¨ıve kernel, the GSE kernel and the VSE kernel. From its definition,
it can be seen that the Kendall’s τ coefficient only depends on the order of
variable pairs. Hence, given two ordinal vectors, we can first order all elements
in each vector, then replace the original value of every element with its rank
in each vector, and finally conduct the τ coefficient computation on the rank
transformed new vectors. This rank transformation does not affect the resulting
coefficient value, but could streamline the computation, especially for ordinal
variables in complex forms of representation. Moreover, this transformation
needs to be done only once beforehand for each vector. Hence, we will assume
that all ordinal vectors have already been rank transformed in the following
discussions. For the convenience of discussion, Table 1 shows a list of notions
used across our study.
3.1. Na¨ıve Kernel
The na¨ıve kernel enumerates all possible combinations of joint variables
(ui, vi) (0 ≤ i < n) and counts the number of concordant pairs nc as well
as the number of discordant pairs nd. As mentioned above, given two joint
variable pairs (ui, vi) and (uj , vj) (i 6= j), they are considered concordant if
ui > uj and vi > vj or ui < uj and vi < vj , discordant if ui > uj and vi < vj
or ui < uj and vi > vj , and neither concordant nor discordant if ui = uj or
vi = vj . Herein, we can observe that the two joint variables are concordant
if and only if the value of (ui − uj) × (vi − vj) is positive; discordant if and
only if the value of (ui − uj)× (vi − vj) is negative; and neither concordant nor
5
discordant if and only if the value of (ui−uj)× (vi−vj) is equal to zero. In this
case, in order to avoid branching in execution paths (particularly important for
processors without hardware branch prediction units), we compute the value of
nc−nd by examining the sign bit of the product of ui− uj and vi− vj (refer to
lines 2 and 11 in Algorithm 1).
Algorithm 1 Pseudocode of our na¨ıve kernel
1: function calc sign(v)
2: return (v > 0)− (v < 0); ⊲ return 1 if v > 0, -1 if v < 0 and 0, otherwise
3: end function
4: function kendall tau a na¨ıve(u, v, n)
5: norminator = 0; ⊲ norminator represents nc − nd
6: for i = 1; i < n; ++i do
7: a = ui; b = vi;
8: #pragma vector aligned
9: #pragma simd reduction(+:nominator)
10: for j = 0; j < i; ++j do
11: nominator += calc sign((a− uj))× (b− vj)); ⊲ compute nc − nd
12: end for
13: end for
14: return nominator
n(n−1)/2
;
15: end function
Algorithm 1 shows the pseudocode of the na¨ıve kernel. From the code, the
na¨ıve kernel has a quadratic time complexity in a function of n, but its runtime
is independent of the actual content of u and v, due to the use of function
calc sign. Meanwhile, the space complexity is O(1). Note that this na¨ıve
kernel is only used to compute τA.
3.2. Generic Sorting-enabled Kernel
Considering the close relationship between calculating τ and ordering a list of
variables, Knight [32] proposed a merge-sort-like divide-and-conquer approach
with O(n logn) time complexity, based on the assumption that no element tie
exists within any vector. As this assumption is not always the case, Knight did
mention this drawback and suggested an approximation method by averaging
counts, rather than propose an exact solution. In this subsection, we investigate
an exact sorting-enabled solution to address both cases: with or without element
ties within any vector, together in a unified manner.
Given two ordinal vectors u and v, this GSE kernel generally works in the
following five steps.
• Step1 sorts the list of joint variables (ui, vi) (0 ≤ i < n) in ascending order,
where the joint variables are sorted first by the first element ui and sec-
ondarily by the second element vi. In this step, we used quicksort via the
standard qsort library routine, resulting in O(n log n) time complexity.
• Step2 performs a linear-time scan over the sorted list to compute n1 (refer
to Equation (3)) by counting the number of groups consisting of tied
values as well as the number of tied values in each group. Meanwhile, we
compute a new value n3 for joint ties, with respect to the pair (ui, vi), as
6
∑
iwi(wi − 1)/2, where wi represents the number of jointly tied values in
the i-th group of joint ties for u and v.
• Step3 counts the number of discordant pairs nd by re-sorting the sorted
list obtained in Step1 in ascending order of all elements in v via a merge
sort procedure that can additionally accumulate the number of discordant
joint variable pairs each time two adjacent sorted subarrays are merged.
The rationale is as follows. Firstly, when merging two adjacent sorted
subarrays, we count the number of discordant pairs by only performing
pairwise comparison between joint variables from distinct subarrays. In
this way, we can ensure that every pair of joint variables will be enu-
merated once and only once during the whole Step3 execution. Secondly,
given two adjacent sorted subarrays to merge, it is guaranteed that the
first value (corresponding to u) of every joint variable in the left subarray
(with the smaller indices) is absolutely less than or equal to the first value
of every joint variable in the right subarray (with the larger indices), due
to the sort conducted in Step1. In particular, when the first value is iden-
tical for the two joint variables from distinct subarrays, the second value
(corresponding to v) of the joint variable from the left subarray is also ab-
solutely less than or equal to the second value of the joint variable from the
left subarray. This means that discordance occurs only if the second value
of a joint variable from the right subarray is less than the second value of
a joint variable from the left subarray. Therefore, the value of nd can be
gained by accumulating the number of occurrences of the aforementioned
discordance in every pairwise merge of subarrays. Algorithm 2 shows the
pseudocode of counting discordant pairs with out-of-place pairwise merge
of adjacent sorted subarrays, where the time complexity is O(n log n) and
the space complexity is O(n).
• Step4 performs a linear-time scan over the sorted list obtained in Step3
to compute n2 (see Equation (3)) in a similar way to Step2. This works
because the sorted list actually corresponds to a sorted list of all elements
in v.
• Step 5 computes the numerator nc − nd in Equations (1) and (3) as n0 −
n1 − n2 + n3 − 2nd. Note that if there is no tie in each vector, n1, n2 and
n3 will all be zero. In this case, nc−nd will be equal to n0−2nd as shown
in Equation (2).
From the above workflow, we can see that the GSE kernel takes into account
tied elements within each vector. Unlike the na¨ıve kernel that computes τA,
the GSE kernel targets the computation of τB. As mentioned above, τA is
actually a special case of τB and an algorithm for τB will inherently cover τA.
Therefore, our GSE kernel is able to calculate both τA and τB in a unified
manner. In addition, the GSE kernel has O(n log n) time complexity, since the
time complexity is O(n logn) for both Step1 and Step3, O(n) for both Step2
and Step4, and O(1) for Step 5.
7
Algorithm 2 Pseudocode of Step3 of our GSE kernel
1: function gse merge(in, out, left, mid, right)
2: l = p = left; r = mid; nd = 0;
3: while l < mid && r < right do ⊲ merge two sorted subarrays
4: if in[r].v < in[l].v then
5: nd+ = mid− l; ⊲ count discordant pairs
6: out[p++] = in[r++];
7: else
8: out[p++] = in[l++];
9: end if
10: end while
11: return nd;
12: end function
13: function kendall tau b step3 gse(pairs, buffer, n)
14: nd = 0; in = pairs; out = buffer;
15: for s = 1; s < n; s *= 2 do
16: for l = 0; l < n; l += 2 * s do
17: m = min(l+ s, n);
18: r = min(l + 2 ∗ s, n);
19: nd += gse merge(in, out, l, m, r); ⊲ Perform out-of-place merge
20: end for
21: swap(in, out); ⊲ swap in and out
22: end for
23: if pairs != in then
24: memcpy(pairs, in, n * sizeof(*pairs));
25: end if
26: return nd;
27: end function
3.3. Vectorized Sorting-enabled Kernel
The VSE kernel enhances the GSE kernel by employing 512-bit SIMD vector
instructions on MIC processors to implement vectorized pairwise merge of sorted
subarrays. In contrast with the GSE kernel, the VSE kernel has made the
following algorithmic changes. The first is packing a rank variable pair (ui, vi)
into a signed 32-bit integer. In this packed format, each variable is represented
by 15 bits in the 32-bit integer, with ui taking the most significant 16 bits and
vi the least significant 16 bits. In this case, the VSE kernel limits the maximum
allowable vector size n to 215 − 1 = 32, 767. The second is that due to packing,
we replace the generic variable pair quicksort in Step1 with an integer sorting
method, and re-implement the discordant pair counting algorithm in Step3 (see
Algorithm 2) based on integer representation. In Step1, sorting packed integers
is equivalent to sorting generic variable pairs (ui, vi), since within any packed
integer ui sits in higher bits and vi in lower bits. In Step3, an additional
preprocessing procedure is needed to reset to zero the most significant 16 bits
of each packed integer (corresponding to u). In our implementation, we split a
512-bit SIMD vector into a 16-lane 32-bit-integer vector and then investigated
a vectorized merge sort for Step1 and a vectorized discordant pair counting
algorithm for Step3, both of which use the same pairwise merge method and
also follow a very similar procedure. Algorithm 3 shows the pseudocode of Step3
of our VSE kernel.
In the literature, some research has been done to accelerate sorting algo-
rithms by means of SIMD vectorized pairwise merge of sorted subarrays. Hi-
roshi et al. [33] employed a SSE vectorized odd-even merge network [34] to
merge sorted subarrays in an out-of-core way. Chhugan et al. [35] adopted
8
Algorithm 3 Pseudocode of Step3 our VSE kernel
1: function vse merge(in, out, left, mid, right)
2: l = left; r = mid; p = left;nd = 0;
3: if mid− left < 16||right−mid < 16 then
4: while l < mid && r < right do
5: if input[r] < input[l] then ⊲ correspond to variable v
6: nd+ = mid− l; out[p++] = in[r++];
7: else
8: out[p ++] = in[l ++];
9: end if
10: end while
11: else
⊲ count discordant pairs
12: while l < mid && r < right do
13: if in[r] < in[l] then ⊲ correspond to variable v
14: nd+ = mid− 1; r ++;
15: else
16: l ++;
17: end if
18: end while
⊲ merge two sorted subarrays
19: l = left; r = mid;
20: vMin = mm512 load epi32(in+ l); vMax = mm512 load epi32(in+ r);
21: l+ = 16; r+ = 16;
22: while true do
23: if mm512 reduce min epi32(vMin) ≥ mm512 reduce max epi32(vMax) then
24: mm512 store epi32(out+ p, vMax); p+ = 16; vMax = vMin;
25: else if mm512 reduce min epi32(vMax) ≥ mm512 reduce max epi32(vMin) then
26: mm512 store epi32(out+ p, vMin); p+ = 16;
27: else
⊲ invoke Algorithm 4
28: bitonic merge 16way(vMin, vMax);
29: mm512 store epi32(out+ p, vMin); p+ = 16;
30: end if
31: if l+ 16 ≥ mid||r + 16 ≥ right then
32: break;
33: end if
34: A = in[l];B = in[r]; C = mm512 reduce max epi32(vMax);
35: if C ≤ A && C ≤ B then
36: mm512 store epi32(out+ p, vMax); p+ = 16;
37: vMin = mm512 load epi32(in+ l); l+ = 16;
38: vMax = mm512 load epi32(in+ r); r+ = 16;
39: else if B < A then
40: vMin = mm512 load epi32(in+ r); r+ = 16;
41: else
42: vMin = mm512 load epi32(in+ l); l+ = 16;
43: end if
44: end while
45: end if
⊲ invoke Algorithm 5
46: bitonic merge 16way leftover(in, out, l, r, p,mid, right, vMax);
47: return nd;
48: end function
49: function kendall tau b step3 vse(pairs, buffer, n)
50: nd = 0; in = pairs; out = buffer;
51: for s = 1; s < n; s *= 2 do
52: for l = 0; l < n; l += 2 * s do
53: m = min(l+ s, n);
54: r = min(l + 2 ∗ s, n);
55: nd += vse merge(in, out, l, m, r); ⊲ Perform out-of-place merge
56: end for
57: swap(in, out); ⊲ swap in and out
58: end for
59: if pairs != in then
60: memcpy(pairs, in, n * sizeof(*pairs));
61: end if
62: return nd;
63: end function
9
 u15
u14
u13
u12
u11
u10
u9
u8
u7
u6
u5
u4
u3
u2
u1
u0
v0
v1
v2
v3
v4
v5
v6
v7
v8
v9
v10
v11
v12
v13
v14
v15
L
H
L
H
L
H
L
H
L
H
L
H
L
H
L
H
L
H
L
H
L
H
L
H
L
H
L
H
L
H
L
H
a0
a1
a2
a3
a4
a5
a6
a7
a8
a9
a10
a11
a12
a13
a14
a15
b0
b1
b2
b3
b4
b5
b6
b7
b8
b9
b10
b11
b12
b13
b14
b15
a0
a1
a2
a3
a4
a5
a6
a7
a8
a9
a10
a11
a12
a13
a14
a15
b0
b1
b2
b3
b4
b5
b6
b7
b8
b9
b10
b11
b12
b13
b14
b15
a0
a1
a2
a3
a4
a5
a6
a7
a8
a9
a10
a11
a12
a13
a14
a15
b0
b1
b2
b3
b4
b5
b6
b7
b8
b9
b10
b11
b12
b13
b14
b15
L
H
L
H
L
H
L
H
L
H
L
H
L
H
L
H
L
H
L
H
L
H
L
H
L
H
L
H
L
H
L
H
L
H
L
H
L
H
L
H
L
H
L
H
L
H
L
H
L
H
L
H
L
H
L
H
L
H
L
H
L
H
L
H
L
H
L
H
L
H
L
H
L
H
L
H
L
H
L
H
L
H
L
H
L
H
L
H
L
H
L
H
L
H
L
H
a0
a1
a2
a3
a4
a5
a6
a7
a8
a9
a10
a11
a12
a13
a14
a15
b0
b1
b2
b3
b4
b5
b6
b7
b8
b9
b10
b11
b12
b13
b14
b15
L
H
L
H
L
H
L
H
L
H
L
H
L
H
L
H
L
H
L
H
L
H
L
H
L
H
L
H
L
H
L
H
a0
a1
a2
a3
a4
a5
a6
a7
a8
a9
a10
a11
a12
a13
a14
a15
b0
b1
b2
b3
b4
b5
b6
b7
b8
b9
b10
b11
b12
b13
b14
b15
Figure 1: 16-way bitonic merge network: (1) input data are stored in two 16-lane vectors and
(2) pipelining from left-to-right
similar ideas to [33], but combined a SSE vectorized in-register odd-even merge
sort [34] with an in-memory bitonic merge network. Chen et al. [36] absorbed
the merge-path idea of [37, 38] and extended the SSE vectorized work of [35]
to take advantage of 512-bit SIMD VPUs on KNC Phis and used a vectorized
in-register bitonic merge sort, instead of the in-register odd-even merge sort.
In our VSE kernel, we engineered a 512-bit SIMD vectorized in-register bitonic
merge network as the core of our pairwise merge procedure for sorted subarrays,
which is similar to [36], and further proposed a predict-and-skip mechanism to
reduce the number of comparisons during pairwise merge.
3.3.1. In-register bitonic merge network
The in-register bitonic merge network is the core of our vectorized pairwise
merge of sorted subarrays. In our algorithm, this network has 16 ways and
merges two sorted vectors vMin and vMax (Figure 1 shows the computation
layout of the network), where all elements in each vector are placed in ascending
order from lane 0 to lane 15. In this case, to generate an input bitonic sequence
from the two vectors, we need to reverse the order of all elements in one and only
one vector (reverse vMax in our case) and this order reversal is realized by one
permutation instruction, i.e. mm512 permutevar epi32(·). Having completed
the order reversal, we can complete the sorting of vectors vMin and vMax in
log2(32) = 5 steps. Algorithm 4 shows the pseudocode for our 16-way in-register
bitonic merge network. From the code, the function bitonic merge 16way is
10
composed of a fixed number of vector instructions and thus has a constant time
complexity.
Algorithm 4 Pseudocode of our 16-way in-register bitonic merge network
1: procedure bitonic merge 16way(vMin, vMax)
⊲ constant vector variables and reused
2: //vReverse = mm512 set epi32(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15);
3: //vPermIndex16 = mm512 set epi32(7, 6, 5, 4, 3, 2, 1, 0, 15, 14, 13, 12, 11, 10, 9, 8)
4: //vPermIndex8 = mm512 set epi32(11, 10, 9, 8, 15, 14, 13, 12, 3, 2, 1, 0, 7, 6, 5, 4);
5: //vPermIndex4 = mm512 set epi32(13, 12, 15, 14, 9, 8, 11, 10, 5, 4, 7, 6, 1, 0, 3, 2);
6: //vPermIndex2 = mm512 set epi32(14, 15, 12, 13, 10, 11, 8, 9, 6, 7, 4, 5, 2, 3, 0, 1);
⊲ Reserve vector vMax
7: vMax = mm512 permutevar epi32(vReverse, vMax);
⊲ Level 1
8: vL1 = mm512 min epi32(vMin, vMax);
9: vH1 = mm512 max epi32(vMin, vMax);
⊲ Level 2
10: vTmp = mm512 permutevar epi32(vPermIndex16, vL1);
11: vTmp2 = mm512 permutevar epi32(vPermIndex16, vH1);
12: vL2 = mm512 mask min epi32(vL2, 0x00ff, vTmp, vL1);
13: vH2 = mm512 mask min epi32(vH2, 0x00ff, vTmp2, vH1);
14: vL2 = mm512 mask max epi32(vL2, 0xff00, vTmp, vL1);
15: vH2 = mm512 mask max epi32(vH2, 0xff00, vTmp2, vH1);
⊲ Level 3
16: vTmp = mm512 permutevar epi32(vPermIndex8, vL2);
17: vTmp2 = mm512 permutevar epi32(vPermIndex8, vH2);
18: vL3 = mm512 mask min epi32(vL3, 0x0f0f, vTmp, vL2);
19: vH3 = mm512 mask min epi32(vH3, 0x0f0f, vTmp2, vH2);
20: vL3 = mm512 mask max epi32(vL3, 0xf0f0, vTmp, vL2);
21: vH3 = mm512 mask max epi32(vH3, 0xf0f0, vTmp2, vH2);
⊲ Level 4
22: vTmp = mm512 permutevar epi32(vPermIndex4, vL3);
23: vTmp2 = mm512 permutevar epi32(vPermIndex4, vH3);
24: vL4 = mm512 mask min epi32(vL4, 0x3333, vTmp, vL3);
25: vH4 = mm512 mask min epi32(vH4, 0x3333, vTmp2, vH3);
26: vL4 = mm512 mask max epi32(vL4, 0xcccc, vTmp, vL3);
27: vH4 = mm512 mask max epi32(vH4, 0xcccc, vTmp2, vH3);
⊲ Level 5: vMin and vMax store the sorted sequence
28: vTmp = mm512 permutevar epi32(vPermIndex2, vL4);
29: vTmp2 = mm512 permutevar epi32(vPermIndex2, vH4);
30: vMin = mm512 mask min epi32(vMin, 0x5555, vTmp, vL4);
31: vMax = mm512 mask min epi32(vMax, 0x5555, vTmp2, vH4);
32: vMin = mm512 mask max epi32(vMin, 0xaaaa, vTmp, vL4);
33: vMax = mm512 mask max epi32(vMax, 0xaaaa, vTmp2, vH4);
34: end procedure
3.3.2. Vectorized pairwise merge of sorted subarrays
Our vectorized pairwise merge of sorted subarrays relies on the aforemen-
tioned 16-way in-register bitonic merge network and adopted a very similar
procedure to [33]. Given two sorted subarrays SA and SB, we assume that they
are aligned to 64 bytes and their lengths |SA| and |SB| are multiples of 16, for
the convenience of discussion. In this way, the vectorized pairwise merge works
as follows.
1. loads the smallest 16 elements of SA and SB to vMin and vMax, respec-
tively, and advances the pointer of each subarray.
2. invokes bitonic merge 16way(·) to sort vectors vMin and vMax, and
then stores the content of vMin, the smallest 16 elements, to the resulting
output array.
11
Algorithm 5 Pseudocode of processing the leftovers in both subarrays
1: procedure bitonic merge 16way leftover(in, out, l, r, p,mid, right, vMax)
2: vIndexInc = mm512 set epi32(15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0);
3: vMaxInt = mm512 set1 epi32(0x7fffffff) ⊲ maximum signed integer value
4: vMid = mm512 set1 epi32(mid);
5: vRight = mm512 set1 epi32(right);
⊲ initialize the vector mask for vMax
6: maskMax = 0xffff;
7: while l < mid||r < right do
8: if l < mid && r < right then
9: if in[r] < in[l] then
10: vTmp = mm512 add epi32( mm512 set1 epi32(r), vIndexInc)
11: maskMin = mm512 cmplt epi32 mask(vTmp, vRight);
12: vMin = mm512 mask load epi32(vMaxInt, maskMin, in + r); r+ = 16;
13: else
14: vTmp = mm512 add epi32( mm512 set1 epi32(l), vIndexInc)
15: maskMin = mm512 cmplt epi32 mask(vTmp, vMid);
16: vMin = mm512 mask load epi32(vMaxInt, maskMin, in + l); l+ = 16;
17: end if
18: else if l < mid then
19: vTmp = mm512 add epi32( mm512 set1 epi32(l), vIndexInc);
20: maskMin = mm512 cmplt epi32 mask(vTmp, vMid);
21: vMin = mm512 mask load epi32(vMaxInt, maskMin, in + l);
22: tmp = mm512 mask reduce max epi32(maskMax, vMax);
23: if tmp ≤ mm512 reduce min epi32(vMin) then
24: break;
25: end if
26: l+ = 16;
27: else if r < right then
28: vTmp = mm512 add epi32( mm512 set1 epi32(r), vIndexInc);
29: maskMin = mm512 cmplt epi32 mask(vTmp, vRight);
30: vMin = mm512 mask load epi32(vMaxInt,maskMin, in + r);
31: tmp = mm512 mask reduce max epi32(maskMax, vMax);
32: if tmp ≤ mm512 reduce min epi32(vMin) then
33: break;
34: end if
35: r+ = 16;
36: end if
⊲ invoke Algorithm 4
37: bitonic merge 16way(vMin, vMax);
⊲ calculate new vector masks
38: nb = mm countbits 32(maskMin) + mm countbits 32(maskMax);
39: maskMin = nb > 15? 0x0ffff : (1 << nb) − 1;
40: maskMax = nb < 17? 0 : (1 << (nb− 16))− 1;
41: mm512 mask store epi32(out+ p,maskMin, vMin);
42: p += mm countbits 32(maskMin);
43: end while
⊲ write out vMax
44: mm512 mask packstorelo epi32(out+ p,maskMax, vMax);
45: mm512 mask packstorehi epi32(out+ p+ 16, maskMax, vMax);
46: p += mm countbits 32(maskMax);
⊲ copy out the rest
47: if l < mid then
48: memcpy(out+ p, in + l, (mid− l)∗ sizeof(int));
49: else if r < right then
50: memcpy(out+ p, in + r, (right − r)∗ sizeof(int));
51: end if
52: end procedure
12
3. compares the next element of SA and SB and loads 16 elements into vMin
from the subarray whose next element is the smaller, followed by pointer
advancing for the subarray.
4. jumps back to the second step and repeats the procedure until all elements
in both subarrays have been processed.
5. completes the merge of SA and SB by writing the content of vMax to the
output array.
Note that if |SA| or |SB| is not a multiple of 16, we will have to implement
some special treatment to deal with the leftovers in SA and SB . This special
treatment is implemented as Algorithm 5 and invoked by Algorithm 3 (see line
46). From the above procedure, we can see that the time complexity of merging
SA and SB is linear and thus the VSE kernel has a time complexity of O(n log n).
As assumed that both SA and SB are aligned to 64 bytes, we can therefore
use only one mm512 load epi32(·) instruction to load 16 integer elements from
memory to a vector and only one mm512 store epi32(·) instruction to store 16
elements in a vector to memory. This is because memory address is guaranteed
to keep aligned to 64 bytes during pairwise merge. Defining Trd to denote
the latency of memory load (in clock cycles) and Twt to denote the latency of
memory store, the number of clock cycles per element load from memory can be
estimated as (Trd + 16)/16 = Trd/16 + 1 clock cycles and the number of clocks
cycles per element store to memory as (Twt+16)/16 = Twt/16+ 1 clock cycles.
Moreover, bitonic merge 16way(·) is composed of 27 vector instructions (see
lines 7∼33 in Algorithm 4) and each of its invocations leads to the output of 16
elements. In these regards, the average number of instructions per element can
be estimated as 27/16 ≈ 1.69. Considering that we only used simple integer
and single-source permutation instructions, each of which has a two clock cycle
latency according to the Intel Xeon Phi processor software optimization manual
[39], the average computational cost per element can be estimated as ⌈2 ×
27/16⌉ = 4 clock cycles in the worst cases where consecutive instructions always
have data dependency. In order to reduce inter-instruction latency caused by
data dependency, we have manually tuned the order of instructions by means
of interleaving. Since each instruction has a latency of two clock cycles, this
interleaving could enable to execute one instruction in only one clock cycle,
thus further reducing the cost per element to ⌈1 × 27/16⌉ = 2 clock cycles
optimistically. In sum, the overall cost per element can be favorably estimated
to be (Trd + Twr)/16 + 4 clock cycles.
We have proposed a predict-and-skip scheme to determine (i) whether we
need to invoke the 16-way bitonic merge network and (ii) whether we should
load two vectors of new elements from SA and SB, respectively. For cases (i),
the rationale is if the minimum value in vMin (vMax) is ≥ the maximum value
in vMax (vMin), every value in vMin (vMax) will be ≥ to all values in vMin
(vMax). In this case, there is no need of invoking the bitonic merge network
(lines 23∼26 in Algorithm 3), since the order of all elements in both vectors
is already deterministic. For cases (ii), we compare the maximum value C in
vMax with both the smallest element A in the rest of SA and the one B in
13
the rest of SB (lines 34∼38 in Algorithm 3). If C ≤ A and C ≤ B, none
of the elements in vMax will be greater than any element in the rest of SA
and SB, indicating that the absolute order of all vMax elements has already
been determined. In this case, we write vMax to the resulting output array
and load two vectors of new elements from SA and SB to vMin and vMax,
respectively. The updates in vMin and vMax will be used for the next iteration
of comparison.
It is worth mentioning that we also developed a 32-way bitonic merge net-
work to implement the VSE kernel. Unfortunately, this new kernel based on
32-way merge network yielded slightly inferior performance to the kernel with
the 16-way network through our tests. In this regard, we adopted the 16-way
merge network in our VSE kernel.
4. Parallel All-Pairs Computation
4.1. All-Pairs Computation Framework
We consider the m ×m job matrix to be a 2-dimensional coordinate space
on the Cartesian plane, and define the left-top corner to be the origin, the
horizontal x-axis (corresponding to columns) in left-to-right direction and the
vertical y-axis (corresponding to rows) in top-to-bottom direction. For non-
symmetric all-pairs computation (non-commutative pairwise computation), the
workload distribution over processing elements (PEs), e.g. threads, processes,
cores and etc., would be relatively straightforward. This is because coordinates
in the 2-dimensional matrix corresponds to distinct jobs. Unlike non-symmetric
all-pairs computation, it suffices by only computing the upper-triangle (or lower-
triangle) of the job matrix for symmetric all-pairs computation (commutative
pairwise computation). In this case, balanced workload distribution could be
more complex than non-symmetric all-pairs computation.
In this paper, we propose a generic framework for workload balancing in
symmetric all-pairs computation, since the computation of the τ coefficient be-
tween any u and v is commutable as mentioned above. This framework works
by assigning each job a unique identifier and then building a bijective relation-
ship between a job identifier Jm(y, x) and its corresponding coordinate (y, x).
We refer to this as direct bijective mapping. While facilitating balanced
workload distribution, this mapping merely relies on bijective functions, which
is a prominent feature distinguished from existing methods. To the best of our
knowledge, in the literature bijective functions have not ever been proposed for
workload balancing in symmetric all-pairs computation. In [40], the authors
used a very similar job numbering approach to ours (shown in this study), but
did not derive a bijective function for symmetric all-pairs computation. Our
framework can be applied to cases with identical (e.g. using static workload
distribution) or varied workload per job (e.g. using a shared integer counter
to realize dynamic workload distribution via remote memory access operations
in MPI [41] and Unified Parallel C (UPC) programming models [42] [43]) and
is also particularly useful for parallel computing architectures with hardware
14
 0 1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20
21 22 23 24 25
26 27 28 29
30 31 32
33 34
35
(a)
 
     
"
(b)
 
 ! " # 0.5 #
"$ %" % 0.25 # 2&' % 1(
 
 ! " # 0.5 %
"$ %" % 0.25 # 2&' % 1(
 
 
*"& (
 ! "# 0.5
(c)
Figure 2: (a) an example direct bijective mapping between job identifier and coordinate space,
(b) solution region for Equation (7), and (c) solution region for Equation (8)
schedulers such as GPUs and FPGAs. In the following, without loss of gener-
ality, we will interpret our framework relative to the upper triangle of the job
matrix by counting in the major diagonal. Nonetheless, this framework can be
easily adapted to the cases excluding the major diagonal.
Direct bijective mapping. Given a job (y, x) in the upper triangle, we compute
its integer job identifier Jm(y, x) as
Jm(y, x) = Fm(y) + x− y, 0 ≤ y ≤ x < m (4)
for dimension size m. In this equation, Fm(y) is the total number of cells
preceding row y in the upper triangle and is computed as
Fm(y) =
y(2m− y + 1)
2
(5)
where y varies in [0,m]. In this way, we have defined Equation (4) based on our
job numbering policy, i.e. all job identifiers vary in [0,m(m+ 1)/2) and jobs
are sequentially numbered left-to-right and top-to-bottom in the upper triangle
(see Fig. 2a for an example).
Reversely, given a job identifier J = Jm(y, x) (0 ≤ J < m(m+1)/2), we need
to compute the coordinate (y, x) in order to locate the corresponding variable
pair. As per our definition, we have
{
J ≥ Fm(y)⇔ y2 − (2m+ 1)y + 2J ≥ 0
J ≤ Fm(y + 1)− 1⇔ y2 − (2m− 1)y + 2(J + 1)− 2m ≤ 0 (6)
By solving J ≥ Fm(y) (see Figure 2b), we get
y ≤ m+ 0.5−
√
m2 +m+ 0.25− 2J (7)
while getting
y ≥ m− 0.5−
√
m2 +m+ 0.25− 2(J + 1) (8)
15
by solving J ≤ Fm(y + 1) − 1 (see Figure 2c). In this case, by defining ∆ =√
m2 +m+ 0.25− 2(J + 1), ∆′ = √m2 +m+ 0.25− 2J and z = m − 0.5 −√
m2 +m+ 0.25− 2(J + 1), we can reformulate Equations (7) and (8) to be
z ≤ y ≤ z + 1 + ∆ − ∆′. Because ∆ < ∆′, we know that [z, z + 1 + ∆ −∆′]
is a sub-range of [z, z + 1) and thereby have z ≤ y < z + 1. Based on our job
numbering policy stated before, as a function of integer y, Equation (6) definitely
has y solutions, meaning that at least one integer exists in [z, z + 1 +∆ −∆′],
which satisfies Equation (6). Meanwhile, it is known that there always exists
one and only one integer in [z, z + 1) (can be easily proved) and this integer
equals ⌈z⌉, regardless of the value of z. Since [z, z + 1+∆−∆′] is a sub-range
of [z, z + 1), we can conclude that Equation (6) has a unique solution y that is
computed as
y = ⌈z⌉ =
⌈
m− 0.5−
√
m2 +m+ 0.25− 2(J + 1)
⌉
(9)
Having got y, we can compute the coordinate x as
x = J + y − Fm(y) (10)
based on Equation (4).
4.2. Tiled Computing
Based on the direct bijective mapping, we adopted tiled computing with
the intention to benefit from L1/L2 caches. The rationale behind the tiled
computing is loading into cache a small subset of the bigger dataset and reusing
this block of data in cache for multiple passes. This technique partitions a
matrix into a non-overlapping set of equal-sized q × q tiles. In our case, we
partition the job matrix and produce a tile matrix of size w × w tiles, where
w = ⌈m/q⌉. In this way, all jobs in the upper triangle of the job matrix are still
fully covered by the upper triangle of the tile matrix. By treating a tile as a
unit, we can assign a unique identifier to each tile in the upper triangle of the
tile matrix and then build bijective functions between tile identifiers and tile
coordinates in the tile matrix, similarly as we do for the job matrix.
Because the tile matrix has an identical structure to the original job matrix,
we can directly apply our bijective mapping to the tile matrix. In this case,
given a coordinate (yq, xq) (0 ≤ yq ≤ xq < w) in the upper triangle of the tile
matrix, we can compute a unique tile identifier Jw(yq, xq) as
Jw(yq, xq) = Fw(yq) + xq − yq, 0 ≤ yq ≤ xq < w (11)
where Fw(yq) is defined similar to Equation (5) as
Fw(yq) =
yq(2w − yq + 1)
2
(12)
Likewise, given a tile identifier Jw, such that 0 ≤ Jw < w(w + 1)/2, we can
16
reversely compute its unique vertical coordinate yq as
yq =
⌈
w − 0.5−
√
w2 + w + 0.25− 2(Jw + 1)
⌉
(13)
and subsequently its unique horizontal coordinate xq as
xq = Jw + yq − Fw(yq) (14)
As title size is subject to cache sizes and input data, tuning tile size is
believed to be important for gaining high performance. This tuning process,
however, is tedious and has to be conducted case-by-case. For convenience, we
empirically set q to 8 for CPUs and to 4 for Phis. Nevertheless, users can feel
free to tune tile sizes to meet their needs.
4.3. Multithreading
4.3.1. Asynchronous kernel execution
When m is large, we may not have sufficient memory to reside the resulting
m×m correlation matrix Mτ entirely in memory. To overcome memory limita-
tion, we adopted a multi-pass kernel execution model which partitions the tile
identifier range [0, w(w+1)/2) into a set of non-overlapping sub-ranges (equal-
sized in our case) and finishes the computation one sub-range after another. In
this way, we do not need to allocate the whole memory for matrix Mτ . Instead,
we only need to allocate a small amount of memory to store the computing
results of one sub-range, thus considerably reducing memory footprint.
When using the KNC Phi, we need to transfer the newly computed results
from the Phi to the host after having completed each pass of kernel execution.
If kernel execution and data transfer is conducted in sequential, the Phi will
be kept idle during the interim of device-to-host data transfer. In this regard,
a more preferable solution would be to employ asynchronous kernel execution
by enabling concurrent execution of host-side tasks and accelerator-side kernel
execution. Fortunately, KNC Phis enable such a kind of asynchronous data
transfer and kernel execution associated with the offload model. This asyn-
chronous execution can be realized by coupling the signal and wait clauses,
both of which are associated with each other via a unique identifier. That is,
a signal clause initiates an asynchronous operation such as data transfer and
kernel execution, while a wait clause blocks the current execution until the as-
sociated asynchronous operation is completed. Refer to our previous work [28]
for more details about the host-side asynchronous execution workflow proposed.
4.3.2. Workload balancing
For the variant using the na¨ıve kernel, all jobs have the same amount of
computation (refer to Algorithm 1). Thus, given a fixed number of threads, we
evenly distribute jobs over the set of threads by assigning a thread to process
one tile at a time. In contrast, for the two variants using the sorting-enabled
kernels, different jobs may have different amount of computation because of
17
the two rounds of sort used in Step1 and Step3. Typically, an OpenMP dy-
namic scheduling policy is supposed to be in favor to address workload irregu-
larity, albeit having relatively heavier workload distribution overhead than static
scheduling. However, it is observed that dynamic scheduling produces slightly
inferior performance to static scheduling through our tests. In this regard, we
adopted static scheduling in our two sorting-enabled variants and assigned one
thread to process one tile at a time, same as the na¨ıve variant.
4.4. Distributed Computing
On KNC Phi clusters, we used MPI offload model which launches MPI
processes just as an ordinary CPU cluster does. In this model, one or more
Phis will be associated to a parental MPI process, which utilizes offload prag-
mas/directives to interact with the affiliated Phis, and inter-Phi communications
need to be explicitly managed by parental processes. In this sense, Phis may not
perceive the existence of remote inter-process communications. Our distributed
implementations require one-to-one correspondence between MPI processes and
Phis, and adopted a static workload distribution scheme based on tiled com-
putation. This static distribution is inspired by our practice in multithreading
on single Phis. In our implementation, given p processes, we evenly distribute
tiles onto the p processes with the i-th (0 ≤ i < p) process assigned to compute
the tiles whose identifiers are in [i×⌈w(w+1)2p ⌉, (i+1)× ⌈w(w+1)2p ⌉). Within each
process, we adopted asynchronous execution workflow as well.
5. Performance Evaluation
We assessed the performance of our algorithm from four aspects: (i) compar-
ison between our three variants, (ii) comparison with widely used counterparts:
MATLAB (version R2015b) and R (version 3.2.0), (iii) multithreading scalabil-
ity on a single Phi, and (iv) distributed computing scalability on Phi clusters.
In these tests, we used four real whole human genome gene expression datasets
(refer to Table 2) produced by Affymetrix Human Genome U133 Plus 2.0 Array.
These datasets are publicly available in the GPL570 data collection of SEEK
[1], a query-based computational gene co-expression search engine over large
transcriptomic databases. In this study, unless otherwise specified, we compute
τ coefficients between genes, meaning that m is equal to the number of genes
and n equal to the number of samples for each dataset.
All tests are conducted on 8 compute nodes in CyEnce HPC Cluster (Iowa
State University), where each node has two high-end Intel E5-2650 8-core 2.0
GHz CPUs, two 5110P Phis (each has 60 cores and 8 GB memory) and 128 GB
memory. Our algorithm is compiled by Intel C++ compiler v15.0.1 with option
-fast enabled. In addition, for distributed computing scalability assessment,
when two processes are launched into the same node, we used the environment
variable I MPI PIN PROCESSOR LIST to guide Intel MPI runtime system to pin
the two processes within the node to distinct CPUs (recall that one node has
two CPUs).
18
Table 2: Information of whole human genome gene expression datasets used
Name No. of Genes No. of Samples Platform
DS21050 17,941 310 Affymetrix
DS19784 17,941 320 Affymetrix
DS13070 17,941 324 Affymetrix
DS3526 17,941 353 Affymetrix
5.1. Assessment of Our Three Variants
All of the three variants work on CPUs, Phis and their clusters. For the
na¨ıve and GSE variants, they both used the same C++ core code for CPU-
and MIC-oriented instances. For the VSE variant, its 512-bit SIMD vectoriza-
tion is only applicable to Phis. In this regard, to support CPUs, we further
developed a non-vectorized merge sort in Step1 and a non-vectorized discor-
dant pair counting algorithm in Step3 instead, based on the aforementioned
packed integer representation. Considering that this non-vectorized version also
works on Phis, we have examined how much our 512-bit SIMD vectorization
contributes to speed improvement by comparing the vectorized version with the
non-vectorized one on the same 5110P Phi. Interestingly, by measuring the
correlation between genes using the datasets in Table 2, we observed that the
non-vectorized version is on a par with the vectorized one. This may be due to
the relatively small value of n, which is only 353 at maximum. Based on this
consideration, we further evaluated both versions by measuring the correlation
between samples, wherem will be equal to the number of samples and n equal to
the number of genes. In this case, n becomes 17,941 for each dataset, more than
50× larger than before. In this context, our performance assessment exposed
that the vectorized version is consistently superior to the non-vectorized one,
yielding very stable speedups averaged to be 1.87 for all datasets. This result
was exciting, proving that our 512-bit SIMD vectorization did boost speed even
for moderately large n values. Hence, we have used the vectorized version of
the VSE variant for performance measurement all throughout our study.
5.1.1. On CPU
We first compared the performance of our three variants on multiple CPU
cores. Table 3 shows the performance of each variant on 16 CPU cores and
Figure 3 shows the speedup of the 16-threaded instance of each variant over its
single-threaded one.
For the na¨ıve variant, its 16-threaded instance achieves a roughly constant
speedup of 13.70 over its single-threaded one. This observation is consistent
with our expectation, as the runtime of the na¨ıve kernel is subject to the vector
size n but independent of actual vector content (refer to the implementation
shown in Algorithm 1). In contrast, the speed of the two sorting-enabled kernels
are sensitive to vector content to some degree. This can be explained by the
following three factors: variable pair sorting in Step1, discordant pair counting
based on pairwise merge of sorted subarrays in Step3, and linear-time scans
19
 13.4
13.5
13.6
13.7
13.8
13.9
14.0
DS21050 DS19784 DS13070 DS3526
S
p
ee
d
u
p
Naïve GSE VSE
Figure 3: Speedups of 16-threaded instances over single-thread ones on multiple CPU cores
for determining tie groups in Step2 and Step4. Nevertheless, for both sorting-
enabled variants, their 16-threaded instances yield relatively consistent speedups
over their single-threaded ones, respectively. For the GSE (VSE) variant, its 16-
threaded instance runs 13.74× (13.79×) faster than its single-threaded one on
average, with a minimum speedup 13.63 (13.69) and a maximum speedup 13.90
(13.97). For each case, the two sorting-enabled variants both yield superior
performance to the na¨ıve variant, where the average speedup is 1.60 for the
GSE kernel and 2.72 for the VSE kernel.
Table 3: Performance comparison of three variants on multiple CPU cores
Instance Dataset
Time (s) Speedup
Na¨ıve GSE VSE GSE VSE
Single-threaded
DS21050 10,052 6,609 3,758 1.52 2.67
DS19784 10,690 6,776 3,938 1.58 2.71
DS13070 10,975 6,920 4,041 1.59 2.72
DS3526 12,993 7,619 4,742 1.71 2.74
16-threaded
DS21050 734 475 275 1.54 2.67
DS19784 781 495 286 1.58 2.73
DS13070 801 504 294 1.59 2.72
DS3526 948 559 340 1.70 2.79
20
Table 4: Performance comparison of the three variants on Phi
Dataset
Time (s) Speedup
Na¨ıve GSE VSE GSE VSE
DS21050 400 419 261 0.95 1.53
DS19784 424 434 266 0.98 1.59
DS13070 433 459 276 0.94 1.57
DS3526 506 488 296 1.04 1.71
5.1.2. On Xeon Phi
Subsequently, we evaluated the three variants on a single 5110P Phi (see
Table 4). From Table 4, the VSE variant outperforms the na¨ıve variant for each
dataset by yielding an average speedup of 1.60 with the minimum speedup 1.53
and the maximum speedup 1.71. The GSE variant is superior to the na¨ıve one
on the DS3526 dataset, but inferior to the latter on the remaining three datasets.
Interestingly, by revisiting the CPU results shown in Table 3, we recalled that
the former actually performs consistently better than the latter for each dataset
on CPU. Since both the na¨ıve and the GSE variants use the same pairwise τ
coefficient kernel code for their corresponding CPU- and MIC-oriented imple-
mentations, this discordant performance ranking between the two variants could
owe to the architectural differences of the two types of processing unit (PU).
For instance, by enabling auto-vectorization, the na¨ıve kernel (see Algorithm
1) can concurrently process 16 integer elements by one 512-bit SIMD vector
instruction, in contrast with 4 integer elements by one 128-bit SSE instruction
on CPU. This fourfold increase in parallelism would enable the na¨ıve kernel to
further boost performance.
Figure 4 shows the speedups of each variant on the Phi over on multiple
CPU cores for each dataset. From Figure 4, it is observed that the Phi instance
of each variant outperforms its corresponding single-threaded and 16-threaded
CPU instances for each dataset. Specifically, the Phi instance of the na¨ıve
variant runs 25.34× faster on average than its single-threaded CPU instance,
with the maximum speedup 25.67, and 1.85× faster on average than the 16-
threaded CPU instance, with the maximum speedup 1.87. Compared to their
single-threaded CPU instances, the Phi instances of the GSE and VSE variants
produce the average speedups of 15.52 and 14.96 and the maximum speedups
of 15.77 and 16.00, respectively. Meanwhile, in comparison to their 16-threaded
CPU instances, their Phi instances performs 1.13× and 1.08× better on average
and 1.15× and 1.15× better at maximum, respectively.
5.2. Comparison With MATLAB and R
Secondly, we compared our algorithm with all-pairs τ coefficient implemen-
tations in the widely used MATLAB (version R2015b) and R (version 3.2.0).
MATLAB and R are executed in one 16-core compute node mentioned above,
where MATLAB runs 16 threads as it supports multithreading and R runs in
sequential. For fair comparison, both MATLAB and R merely compute all-pairs
21
 0
5
10
15
20
25
30
DS21050 DS19784 DS13070 DS3526
S
p
ee
d
u
p
Naïve GSE VSE
(a)
 
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
2.0
DS21050 DS19784 DS13070 DS3526
S
p
ee
d
u
p
Naïve GSE VSE
(b)
Figure 4: Speedups of each variant on Phi over on CPU: (a) speedups over one CPU core and
(b) speedups over 16 CPU cores.
Table 5: Performance comparison with MATLAB and R
Dataset
Time (s) Speedup over MLAB Speedup over R
MLAB R Na¨ıve GSE VSE Na¨ıve GSE VSE
DS21050 182,343 266,001 456 435 699 665 635 1,019
DS19784 205,181 284,215 484 473 772 671 655 1,069
DS13070 205,063 287,635 473 446 743 664 626 1,042
DS3526 240,778 345,668 476 494 812 683 709 1,166
τ coefficient values, without conducting statistical tests, same as our algorithm
does. Table 5 shows the runtimes of the 16-threaded MATLAB and sequential
R instances as well as their comparison with our three variants that execute on
the 5110P Phi. From the table, each variant demonstrates excellent speedups,
i.e. two orders-of-magnitude over 16-threaded MATLAB and three orders-of-
magnitude over sequential R. Compared to 16-threaded MATLAB, the average
and maximum speedups are 472 and 484 for the na¨ıve variant, 462 and 494 for
the GSE variant and 756 and 812 for the VSE variant, respectively. In contrast,
the average and maximum speedups over sequential R are 671 and 683 for the
na¨ıve variant, 656 and 709 for the GSE variant and 1,074 and 1,166 for the VSE
variant, respectively.
5.3. Parallel Scalability Assessment
5.3.1. Multithreading
Thirdly, we evaluated the multithreading scalability of our variants on the
Phi with respect to number of threads. Figure 5 shows the parallel scalability
of the three variants. In this test, we used balanced thread affinity to ensure
that thread allocation is balanced over the cores and the threads allocated to
the same core have consecutive identifiers (i.e. neighbors of each other). This
22
 0
200
400
600
800
1000
1200
1400
DS21050 DS19784 DS13070 DS3526
T
im
e 
(s
)
59T 118T 177T 236T
(a)
 
0
200
400
600
800
1000
1200
1400
DS21050 DS19784 DS13070 DS3526
T
im
e 
(s
)
59T 118T 177T 236T
(b)
 
0
200
400
600
800
1000
1200
1400
DS21050 DS19784 DS13070 DS3526
T
im
e 
(s
)
59T 118T 177T 236T
(c)
Figure 5: Multithreading scalability of different variants: (a) the na¨ıve variant, (b) the GSE
variant and (c) the VSE variant.
thread affinity configuration is attained by setting the environment variable
KMP AFFINITY to balanced.
From the figures, it is observed that each variant gets performance improved
as the number of active threads per core grows from 1, via 2 and 3, to 4 (cor-
responding to 59, 118, 177 and 236 threads, respectively). As each core is dual
issue and employs in-order execution, at least two threads per core are required
in order to saturate the compute capacity of each core. Meanwhile, when moving
from one thread per core to two threads per core, we expect that the speedup
could be close to 2. In this regard, we investigated the speedup of the instance
with two threads per core over the one with one thread per core for each vari-
ant, and found that the average speedups are 1.58, 1.61 and 1.56 for the na¨ıve,
GSE and VSE variants, respectively. This finding is close to our expectation
largely. Furthermore, since all threads per core share the dual in-order exe-
cution pipeline, deploying three or four threads per core may further improve
performance but normally with decreased parallel efficiency (with respect to
threads rather than cores). Note that for the 5110P Phi, only 59 out of 60 cores
are used for computing as one core is reserved for the operating system running
inside. Since each variant reaches peak performance at 236 threads, we used
this number of threads for performance evaluation in all tests, unless otherwise
stated.
5.3.2. Distributed computing
Finally, we measured the parallel scalability of our algorithm by varying the
number of Phis used in a distributed environment (see Figure 6). This test is
conducted in a cluster that consists of 16 Phis and is constituted by the afore-
mentioned 8 compute nodes. Each variant used 236 threads, since this setting
leads to the best performance as mentioned above. From Figure 6, the na¨ıve
variant demonstrates nearly constant speedups for all datasets on a specific
number of Phis, while the two sorting-enabled variants exposed slight fluctua-
tions under the same hardware configuration. This can be explained by the fact
that the runtime of our na¨ıve variant is independent of vector content, whereas
that of each sorting-enabled variant is sensitive to actual vector content to some
degree. Moreover, for each dataset, it is observed that every variant has demon-
23
 0
2
4
6
8
10
12
14
16
18
2 4 8 16
S
p
ee
d
u
p
Number of Phis
DS21050
DS19784
DS13070
DS3526
(a)
 
0
2
4
6
8
10
12
14
16
18
2 4 8 16
S
p
ee
d
u
p
Number of Phis
DS21050
DS19784
DS13070
DS3526
(b)
 
0
2
4
6
8
10
12
14
16
18
2 4 8 16
S
p
ee
d
u
p
Number of Phis
DS21050
DS19784
DS13070
DS3526
(c)
 
0
100
200
300
400
500
600
1 2 4 8 16
T
im
e 
(s
)
Number of Phis
Naïve GSE VSE
(d)
Figure 6: Distributed computing scalability: (a) the na¨ıve variant, (b) the GSE variant and
(c) the VSE variant and (d) runtimes of each variant on the DS3526 dataset.
strated rather good scalability with respect to number of Phis. Concretely, the
na¨ıve variant achieves an average speedup of 2.00, 3.99, 7.96 and 15.81, the GSE
one yields an average speedup of 1.93, 3.82, 7.63 and 14.97, and the VSE one
produces an average speedup of 1.99, 3.94, 7.86 and 15.23, on 2, 4, 8 and 16
Phis, respectively.
6. Conclusion
Pairwise association measure is an important operation in searching for
meaningful insights within a dataset by examining potentially interesting rela-
tionships between data variables of the dataset. However, all-pairs association
computation are computationally demanding for a large volume of data and may
take a prohibitively long sequential runtime. This computational challenge for
big data motivated us to use parallel/high-performance computing architectures
to accelerate its execution.
In this paper, we have investigated the parallelization of all-pairs Kendall’s τ
coefficient computation using MIC processors as the accelerator. To the best of
our knowledge, this is for the first time that all-pairs τ coefficient computation
24
has been parallelized on MIC processors. For the τ coefficient, we have developed
three variants, namely the na¨ıve variant, the GSE variant and the VSE variant,
based on three pairwise τ coefficient kernels, i.e. the na¨ıve kernel, the GSE
kernel and the VSE kernel. The three variants all can execute on CPUs, Phis and
their clusters. The na¨ıve kernel has a time complexity of O(n2), while the other
two sorting-enabled kernels have an improved time complexity of O(n log n).
Furthermore, we have proposed a generic framework for workload balancing
in symmetric all-pairs computation and overcoming memory limitation on the
host or accelerator side. This framework assigns unique identifiers to jobs in the
upper triangle of the job matrix and builds provable bijective functions between
job identifier and coordinate space.
The performance of the three variants was evaluated using a collection of
real gene expression datasets produced from whole human genomes. Our exper-
imental results demonstrated that on one 5110P Phi, the na¨ıve variant runs up
to 25.67× faster than its execution on a single CPU core, and up to 1.87× faster
than on 16 CPU cores. On the same Phi, the GSE and VSE variants run up to
15.77× and 16.00× faster than their executions on a single CPU core, and up
to 1.15× and 1.15× faster than on 16 CPU cores, respectively. Meanwhile, on
the same CPU/Phi hardware configuration, the VSE variant achieves superior
performance to the na¨ıve and GSE variants. Interestingly, the na¨ıve variant was
observed to be inferior to the GSE variant on CPUs for both single-threaded
and 16-threaded settings, but became superior to the latter on the 5110P Phi
for most benchmarking datasets used. As both variants use the same core
C++ code for their corresponding CPU- and MIC-oriented implementations as
mentioned above, this observation suggests that the architectural differences
between different types of PUs may make substantial impact on the resulting
performance. Subsequently, we compared our algorithm with the third-party
counterparts implemented in the popular MATLAB and R, observing that our
algorithm on a single 5110P Phi runs up to 812× faster than 16-threaded MAT-
LAB and up to 1, 166× faster than sequential R, both of which are executed
on high-end Intel E5-2650 CPUs. We further assessed the parallel scalability
of our algorithm with respect to number of Phis in a distributed environment,
exposing that our algorithm demonstrated rather good distributed computing
scalability.
Finally, it is worth mentioning that the current version of our VSE kernel
constrains the largest n value to 215 − 1, due to 32-bit integer packing. One
solution to overcome this constraint is packing a variable pair (ui, vi) into a
64-bit integer with each variable occupying 32 bits. In this case, we can split
a 512-bit vector into 8 lanes with each lane representing a 64-bit integer, and
then implement vectorized pairwise merge of sorted subarrays based on 64-bit
integers. Nonetheless, it should be noted that this 64-bit solution has twice less
parallelism than 32-bit. In the future, we plan to combine this non-linear mea-
sure with other linear or non-linear correlation/dependencemeasures to generate
fused co-expression networks for genome-wide gene expression data analysis at
population level. In addition, since R programming language is frequently used
in data analytics and the R implementation of all-pairs τ coefficient computation
25
is extremely slow for large-scale datasets, we also plan to release a R package
for public use based on our research in this study.
Acknowledgment
This research is supported in part by US National Science Foundation under
IIS-1416259 and an Intel Parallel Computing Center award. Conflict of interest :
none declared.
References
References
[1] Q. Zhu, A. K. Wong, A. Krishnan, M. R. Aure, A. Tadych, R. Zhang, D. C.
Corney, C. S. Greene, L. A. Bongo, V. N. Kristensen, et al., Targeted ex-
ploration and analysis of large cross-platform human transcriptomic com-
pendia, Nature Methods 12 (3) (2015) 211–214.
[2] R. Steuer, J. Kurths, C. O. Daub, J. Weise, J. Selbig, The mutual infor-
mation: detecting and evaluating dependencies between variables, Bioin-
formatics 18 (suppl 2) (2002) S231–S240.
[3] A. J. Butte, I. S. Kohane, Unsupervised knowledge discovery in medical
databases using relevance networks., in: Proceedings of the AMIA Sympo-
sium, American Medical Informatics Association, 1999, p. 711.
[4] M. Mutwil, S. Klie, T. Tohge, F. M. Giorgi, O. Wilkins, M. M. Camp-
bell, A. R. Fernie, B. Usadel, Z. Nikoloski, S. Persson, Planet: combined
sequence and expression comparisons across plant networks derived from
seven species, The Plant Cell 23 (3) (2011) 895–910.
[5] A. Gobbi, G. Jurman, A null model for pearson coexpression networks,
PloS One 10 (6) (2015) e0128115.
[6] A. A. Margolin, I. Nemenman, K. Basso, C. Wiggins, G. Stolovitzky, R. D.
Favera, A. Califano, ARACNE: an algorithm for the reconstruction of gene
regulatory networks in a mammalian cellular context, BMC Bioinformatics
7 (Suppl 1) (2006) S7.
[7] M. Aluru, J. Zola, D. Nettleton, S. Aluru, Reverse engineering and analysis
of large genome-scale gene networks, Nucleic Acids Research 41 (1) (2013)
e24.
[8] A. Lachmann, F. M. Giorgi, G. Lopez, A. Califano, Aracne-ap: gene net-
work reverse engineering through adaptive partitioning inference of mutual
information, Bioinformatics 32 (14) (2016) 2233–2235.
[9] J. Lee Rodgers, W. A. Nicewander, Thirteen ways to look at the correlation
coefficient, The American Statistician 42 (1) (1988) 59–66.
26
[10] L. Song, P. Langfelder, S. Horvath, Comparison of co-expression measures:
mutual information, correlation, and model based indices, BMC Bioinfor-
matics 13 (1) (2012) 328.
[11] C. Spearman, Spearmans rank correlation coefficient, Amer J Psychol 15
(1904) 72–101.
[12] M. G. Kendall, Rank correlation methods, Griffin, 1948.
[13] Y. Wang, Y. Li, H. Cao, M. Xiong, Y. Y. Shugart, L. Jin, Efficient test
for nonlinear dependence of two continuous variables, BMC Bioinformatics
16 (1) (2015) 1.
[14] W. Xu, Y. Hou, Y. Hung, Y. Zou, A comparative analysis of spearman’s
rho and kendall’s tau in normal and contaminated normal models, Signal
Processing 93 (1) (2013) 261–276.
[15] G. A. Darbellay, I. Vajda, et al., Estimation of the information by an
adaptive partitioning of the observation space, IEEE Transactions on In-
formation Theory 45 (4) (1999) 1315–1321.
[16] C. O. Daub, R. Steuer, J. Selbig, S. Kloska, Estimating mutual information
using b-spline functions–an improved similarity measure for analysing gene
expression data, BMC Bioinformatics 5 (1) (2004) 1.
[17] K.-C. Liang, X. Wang, Gene regulatory network reconstruction using con-
ditional mutual information, EURASIP Journal on Bioinformatics and Sys-
tems Biology 2008 (1) (2008) 1–14.
[18] G. J. Sze´kely, M. L. Rizzo, N. K. Bakirov, et al., Measuring and testing
dependence by correlation of distances, The Annals of Statistics 35 (6)
(2007) 2769–2794.
[19] M. R. Kosorok, On brownian distance covariance and high dimensional
data, The Annals of Applied Statistics 3 (4) (2009) 1266.
[20] A. Gretton, O. Bousquet, A. Smola, B. Scho¨lkopf, Measuring statistical
dependence with hilbert-schmidt norms, in: International Conference on
Algorithmic Learning Theory, Springer, 2005, pp. 63–77.
[21] D. N. Reshef, Y. A. Reshef, H. K. Finucane, S. R. Grossman, G. McVean,
P. J. Turnbaugh, E. S. Lander, M. Mitzenmacher, P. C. Sabeti, Detecting
novel associations in large data sets, Science 334 (6062) (2011) 1518–1524.
[22] L. Song, A. Smola, A. Gretton, J. Bedo, K. Borgwardt, Feature selection via
dependence maximization, Journal of Machine Learning Research 13 (May)
(2012) 1393–1434.
[23] B. H. W. Chang, W. Tian, Gsa-lightning: ultra-fast permutation-based
gene set analysis, Bioinformatics 32 (19) (2016) 3029–3031.
27
[24] H. Abdi, The kendall rank correlation coefficient, Encyclopedia of Mea-
surement and Statistics. Sage, Thousand Oaks, CA (2007) 508–510.
[25] S. Wang, I. Pandis, D. Johnson, I. Emam, F. Guitton, A. Oehmichen,
Y. Guo, Optimising parallel R correlation matrix calculations on gene ex-
pression data using mapreduce, BMC Bioinformatics 15 (1) (2014) 1.
[26] R. C. Team, et al., R: A language and environment for statistical comput-
ing, R Foundation for Statistical Computing, 2013.
[27] J. Dean, S. Ghemawat, Mapreduce: simplified data processing on large
clusters, Communications of the ACM 51 (1) (2008) 107–113.
[28] Y. Liu, T. Pan, S. Aluru, Parallel pairwise correlation computation on
intel xeon phi clusters, in: 28th International Symposium on Computer
Architecture and High Performance Computing, IEEE, 2016, pp. 141–149.
[29] Mathworks, https://www.mathworks.com/products/matlab.html (2015).
[30] A. Sodani, R. Gramunt, J. Corbal, H.-S. Kim, K. Vinod, S. Chinthamani,
S. Hutsell, R. Agarwal, Y.-C. Liu, Knights landing: Second-generation intel
xeon phi product, IEEE Micro 36 (2) (2016) 34–46.
[31] J. Jeffers, J. Reinders, Intel Xeon Phi coprocessor high-performance pro-
gramming, Morgan Kaufmann, 2013.
[32] W. R. Knight, A computer method for calculating kendall’s tau with un-
grouped data, Journal of the American Statistical Association 61 (314)
(1966) 436–439.
[33] H. Inoue, T. Moriyama, H. Komatsu, T. Nakatani, Aa-sort: A new parallel
sorting algorithm for multi-core simd processors, in: Proceedings of the
16th International Conference on Parallel Architecture and Compilation
Techniques, IEEE Computer Society, 2007, pp. 189–198.
[34] K. E. Batcher, Sorting networks and their applications, in: Proceedings of
the April 30–May 2, 1968, spring joint computer conference, ACM, 1968,
pp. 307–314.
[35] J. Chhugani, A. D. Nguyen, V. W. Lee, W. Macy, M. Hagog, Y.-K. Chen,
A. Baransi, S. Kumar, P. Dubey, Efficient implementation of sorting on
multi-core simd cpu architecture, Proceedings of the VLDB Endowment
1 (2) (2008) 1313–1324.
[36] T. Xiaochen, K. Rocki, R. Suda, Register level sort algorithm on multi-
core simd processors, in: Proceedings of the 3rd Workshop on Irregular
Applications: Architectures and Algorithms, ACM, 2013, p. 9.
[37] S. Odeh, O. Green, Z. Mwassi, O. Shmueli, Y. Birk, Merge path-cache-
efficient parallel merge and sort, Tech. rep., Technical report, CCIT Report
(2012).
28
[38] O. Green, R. McColl, D. A. Bader, Gpu merge path: a gpu merging al-
gorithm, in: Proceedings of the 26th ACM international conference on
Supercomputing, ACM, 2012, pp. 331–340.
[39] Intel, Intel xeon phi processor software optimization guide,
https://software.intel.com/en-us/articles/intel-xeon-phi-processor-
software-optimization-guide (2016).
[40] T. Kiefer, P. B. Volk, W. Lehner, Pairwise element computation with
MapReduce, in: Proceedings of the 19th ACM International Symposium
on High Performance Distributed Computing, ACM, 2010, pp. 826–833.
[41] W. Gropp, T. Hoefler, R. Thakur, E. Lusk, Using advanced MPI: Modern
features of the message-passing interface, MIT Press, 2014.
[42] T. El-Ghazawi, L. Smith, Upc: unified parallel c, in: Proceedings of the
2006 ACM/IEEE conference on Supercomputing, ACM, 2006, p. 27.
[43] J. Gonza´lez-Domı´nguez, Y. Liu, B. Schmidt, Parallel and scalable short-
read alignment on multi-core clusters using upc++, PLoS ONE 11 (1)
(2016) e0145490.
29
