Co-expression network is a critical technique for the identification of inter-gene interactions, which usually relies on all-pairs correlation (or similar measure) computation between gene expression profiles across multiple samples. Pearson's correlation coefficient (PCC) is one widely used technique for gene co-expression network construction. However, allpairs PCC computation is computationally demanding for large numbers of gene expression profiles, thus motivating our acceleration of its execution using high-performance computing. In this paper, we present LightPCC, the first parallel and distributed all-pairs PCC computation on Intel Xeon Phi (Phi) clusters. It achieves high speed by exploring the SIMDinstruction-level and thread-level parallelism within Phis as well as accelerator-level parallelism among multiple Phis. To facilitate balanced workload distribution, we have proposed a general framework for symmetric all-pairs computation by building bijective functions between job identifier and coordinate space for the first time. We have evaluated LightPCC and compared it to two CPU-based counterparts: a sequential C++ implementation in ALGLIB and an implementation based on a parallel general matrix-matrix multiplication routine in Intel Math Kernel Library (MKL) (all use double precision), using a set of gene expression datasets. Performance evaluation revealed that with one 5110P Phi and 16 Phis, LightPCC runs up to 20.6× and 218.2× faster than ALGLIB, and up to 6.8× and 71.4× faster than single-threaded MKL, respectively. In addition, LightPCC demonstrated good parallel scalability in terms of number of Phis. Source code of LightPCC is publicly available at http://lightpcc.sourceforge.net.
I. INTRODUCTION
Co-expression networks have been frequently used to reverse engineer the whole-genome interactions between complex multicellular organisms by ascertaining common regulation and thus common functions. A gene co-expression network is represented as an undirected graph with nodes being genes and edges representing significant inter-gene interactions. Such a network can be constructed by computing linear (e.g. [1] ) or non-linear (e.g. [2] [3] [4] ) coexpression measures between paired gene expression profiles across multiple samples. As the first formal and wide-spread correlation measure [5] [6] , Pearson's correlation coefficient (PCC), or alias Pearson's r, is one widely used technique in co-expression network construction [7] . However, allpairs PCC computation of gene expression profiles is not computationally trivial for genome-wide association study with large number of gene expression profiles across a large population of samples, especially when coupled with permutation tests for statistical inference. The importance of all-pairs PCC computation and its considerable computing demand motivated us to investigate its acceleration on parallel/high-performance computing architectures.
PCC statistically measures the strength of linear association between pairs of continuous random variables, but does not apply to non-linear relationship. Thus, we must ensure the linearity between paired data prior to the application of PCC. Given two random variables u and v of l dimensions each, the PCC between them is defined as
In Equation (1), u[k] is the k-th element of u, whileū is the mean of u and equal to 1 l l−1 k=0 u [k] . Notations are likewise defined for v. Given a variable pair, the sequential implementation of Equation (1) has a linear time complexity proportional to l. Moreover, it is known that the absolute value of the nominator is always less than or equal to the denominator [5] . Thus, r(u, v) always varies in [−1, +1]. Concretely, r(u, v) = 0 indicates no linear relationship, > 0 positive association and < 0 negative association.
Although PCC is widely used in science and engineering, the acceleration of its computation using parallel/highperformance computing architectures has not yet been extensively investigated in the literature. Chang et al. [8] used the CUDA-enabled GPU to accelerate all-pairs computation of PCC and computed pairwise PCC using the following standard reformulation:
(2) This work was extended by [9] to support GPU clusters, which adopted a master-slave model to manage workload distribution over multiple GPUs. Wang et al. [10] adopted a hybrid CPU-GPU coprocessing model and reformulated each variable u to a new representation w, which is defined as
for each k (0 ≤ k < l) in order to employ general matrix-matrix multiplication (GEMM) parallelization that has been well studied in parallel computing. Note that due to the commutative nature of pairwise PCC computation, this GEMM approach will cause a waste of half horsepower. Similarly, Wang et al. [11] also employed a parallel GEMM approach to accelerate the all-pairs computation of a given dataset X, but on a single Xeon Phi (Phi). Note that our approach is different from [11] , because ours accelerates the overall computation over X on a cluster of Phis.
In this paper, we present LightPCC, the first parallel and distributed algorithm to harness Phi clusters to accelerate all-pairs PCC computation. To achieve high speed, our algorithm explores instruction-level parallelism within SIMD vector processing units per Phi, thread-level parallelism over many cores per Phi, and accelerator-level parallelism across a cluster of Phis. Moreover, we have investigated a general framework for symmetric all-pairs computation to facilitate balanced workload distribution within and between processing elements (PEs), by pioneering to build a reversible and bijective relationship between job identifier and coordinate space in a job matrix. Using both artificial and real gene expression datasets, we have compared LightPCC to two CPU-based counterparts: a sequential C++ implementation in ALGLIB (http://www.alglib.net) and an implementation based on a parallel GEMM routine in Intel Math Kernel Library (MKL). Our experimental results showed that by using one 5110P Phi and 16 Phis, LightPCC is able to run up to 20.6× and 218.2× faster than ALGLIB, and up to 6.8× and 71.4× faster than singled-threaded MKL, respectively. In addition, LightPCC demonstrated good parallel scalability with respect to varied number of Phis.
II. XEON PHI ARCHITECTURE
A Phi coprocessor is a many-core shared-memory computer [12] , which runs a specialized Linux operating system and provides full cache coherency over the entire chip. The Phi is comprised of a set of processor cores, and each core offers four-way simultaneous multithreading, i.e. 4 hardware threads per core. While offering scalar processing, each core also includes a newly-designed VPU which features a 512-bit wide SIMD instruction set architecture (ISA). Each vector register can be split to either 16 32-bit-wide lanes or 8 64-bit-wide lanes. The Phi does not provide support for legacy SIMD ISAs such as the SSE series. As for caching, each core locally has separate L1 instruction and data caches of size 32 KB each, and a 512 KB L2 cache. Moreover, all L2 caches across the entire chip are interconnected, through a bidirectional ring bus, to form a unified shared L2 cache of over 30 MB. In addition, there are two usage models for invoking Phis: offload model and native model. The offload model relies on compiler pragmas/directives to offload highly-parallel parts of an application to the Phi, while the native model treats a Phi as a symmetric multiprocessing computer. As of today, Phis have been used to accelerate important computational problems in diverse research fields such as bioinformatics [13] [14] and machine learning [15] [16] .
III. PARALLELIZED IMPLEMENTATION

A. Pearson's Correlation Coefficient Reformulation
In the case of all-pairs computation, significantly more computation can be further reduced than pure pairwise computation. For instance, given a l-dimensional variable u, the values of
could be repeatedly calculated up to n − 1 times in the case of literal computing using Equation (1) . Since these two values are only dependent on u, they can be computed once beforehand. We define X = {X 0 , X 1 , . . . , X n−1 } to denote a set of n l-dimensional variables and compute the new representation U i of X i as
In this way, the PCC between X i and X j is computed as
From Equation (5), we can see that if organizing all members of U to form a n × l matrix A with U i being row i of A, we can realize the all-pairs computation over U by multiplying matrix A by its transpose, i.e. R = A × A T via a GEMM algorithm. Note that because R is symmetric, direct application of a GEMM algorithm will cause a waste of half compute power as noted before. As mentioned above, Wang et al. [10] also proposed a reformulation in order to benefit from parallel GEMM algorithms (refer to Equation (3)). This reformulation computes the pairwise PCC between X i and X j as
Using this equation, though a GEMM algorithm can be used to compute the nominator, the denominator has to be additionally computed.
B. All-Pairs Computation Framework
We consider the n×n job matrix to be a 2-dimensional coordinate space on the Cartesian plane, and define the left-top corner to be the origin, the horizontal x-axis (corresponding to columns) in left-to-right direction and the vertical y-axis (corresponding to rows) in top-to-bottom direction.
1) Non-symmetric all-pairs computation: For nonsymmetric all-pairs computation (non-commutative pairwise computation), the workload distribution over PEs (e.g. threads, processes, cores and etc.) would be relatively straightforward. This is because coordinates in the 2dimensional matrix corresponds to distinct jobs. Specifically, given a coordinate (y, x) (0 ≤ x, y < n), we can compute its unique job identifier J n (y, x) ∈ [0, n 2 ) as
. Reversely, given a job identifier J n (y, x) ∈ [0, n 2 ), we can compute its unique coordinate as
in the job matrix.
2) Symmetric all-pairs computation: Unlike nonsymmetric all-pairs computation, it suffices by only computing the upper-triangle (or lower-triangle) of the job matrix for symmetric all-pairs computation (commutative pairwise computation). In this case, balanced workload distribution could be more complex than non-symmetric all-pairs computation. For workload distribution, one approach [17] is to allocate a separate job array, totally of n(n + 1)/2 elements if the major diagonal is counted in, with element i (0 ≤ i < n(n + 1)/2) storing the coordinate of the i-th job and then let each PE access this array to obtain the coordinates of the jobs assigned to it. The major drawback of this approach is the extra memory consumed by the job array, since the memory overhead can be huge for large n. Another approach [10] is to use a policy designed for parallel matrix-matrix multiplication by discarding the redundant computing part, but could incur unbalanced workload distribution. In addition, some approaches (e.g. [9] ) use master-slave computing model.
In this paper, we propose a general framework for workload balancing in symmetric all-pairs computation. This framework works by assigning each job a unique identifier and then building a bijective relationship between a job identifier J n (y, x) and its corresponding coordinate (y, x). This mapping is called direct bijective mapping in our context. While facilitating balanced workload distribution, this mapping merely relies on bijective functions, which is a prominent feature distinguished from existing methods. To the best of our knowledge, in the literature bijective functions have not ever been proposed for workload balancing in symmetric all-pairs computation. In [18] , the authors used a very similar job numbering approach to ours in this study, but did not derive a bijective function for symmetric all-pairs computation. Our framework can be applied to cases with identical (e.g. our study) or varied workload per job (e.g. using a shared integer counter to realize dynamic workload distribution via remote memory access operations in MPI and Unified Parallel C (UPC) programming models) and is also particularly useful for parallel computing architectures with hardware schedulers such as GPUs and FPGAs. In the following, without loss of generality, we will interpret our framework relative to the upper triangle of the job matrix by counting in the major diagonal. Nonetheless, this framework can be easily adapted to the cases excluding the major diagonal.
3) Direct bijective mapping: Given a job (y, x) in the upper triangle, we compute its integer job identifier J n (y, x) as
for n variables. In this equation, F n (y) is the total number of cells preceding row y in the upper triangle and is computed as
where y varies in [0, n] and there are two boundary cases needing to be paid attention to: one is when y = 0 and the other is when y = n. When y = 0, F n (0) = 0 because no cell in the upper triangle appears before row 0; and when y = n, F n (n) = n(n + 1)/2 because all cells in the upper triangle are included. In this way, we have defined Equation (9) based on our job numbering policy, i.e. all job identifiers vary in [0, n(n + 1)/2) and jobs are sequentially numbered left-to-right and top-to-bottom in the upper triangle (see Fig.  1 for an example). Reversely, given a job identifier J = J n (y, x) (0 ≤ J < n(n + 1)/2), we need to compute the coordinate (y, x) in order to locate the corresponding variable pair. As per our definition, we have
It needs to be stressed that there is surely an integer value y satisfying these two inequalities based our job numbering policy mentioned above. By solving J ≥ F n (y), we get
This is because (i) n 2 + n + 0.25 > 2J and thus J = F n (y) has two distinct y solutions theoretically, and (ii) all 0 ≤ y < n values are to the left of the symmetric axis y = n + 0.5, meaning strictly monotonically decreasing as a function of y. Meanwhile, by solving J ≤ F n (y + 1) − 1, we get
Similarly, this is because (i) n 2 + n + 0.25 > 2(J + 1) and thus J = F n (y + 1) − 1 has two distinct y solutions 
, we can reformulate Equations (12) and (13) 
and thereby have z ≤ y < z + 1. As mentioned above, as a function of integer y, Equation (11) definitely has y solutions as per our definition, meaning that at least one integer exists in [z, z + 1 + Δ − Δ ], which satisfies Equation (11) . Meanwhile, it is known that there always exists one and only one integer in [z, z + 1) (can be easily proved) and this integer equals z , regardless of the value of z. Since [z, z + 1 + Δ − Δ ] is a sub-range of [z, z + 1), we can conclude that Equation (11) has a unique solution y that is computed as
Having got y, we can compute the coordinate x as
based on Equation (9). Besides this theoretical proof, we also wrote a computer program to test its correctness.
C. Tiled Computation on Xeon Phi
Tiled computation is a frequently used technique in various applications accelerated by accelerators such as Cell/BEs [19] , GPUs [20] and Phis [13] . This technique partitions a matrix into a non-overlapping set of equal-sized t × t tiles. In our case, we partition the job matrix and produce a tile matrix of size m × m tiles, where m equals n/t . In this way, all jobs in the upper triangle of the job matrix are still fully covered by the upper triangle of the tile matrix. By treating a tile as a unit, we can assign a unique identifier to each tile in the upper triangle of the tile matrix and then build bijective functions between tile identifiers and tile coordinates in the tile matrix, similarly as we do for the job matrix.
1) Computing tile coordinates: As mentioned above, we have proposed a bijective mapping between job identifier and coordinate space. Because the tile matrix has an identical structure with the original job matrix, we can directly apply our aforementioned bijective mapping to the tile matrix. In this case, given a coordinate (y t , x t ) (0 ≤ y t ≤ x t < m) in the upper triangle of the tile matrix, we can compute a unique tile identifier J m (y t , x t ) as
Likewise, given a tile identifier J t (0 ≤ J t < m(m + 1)/2), we can reversely compute its unique vertical coordinate y t as
and subsequently its unique horizontal coordinate x t as
2) Multithreaded implementation: Having got the coordinate (y t , x t ) of a tile, we can determine the coordinate range of all jobs per tile, relative to the original job matrix. More specifically, the vertical coordinate y lies in [y t × t, (y t + 1) × t) and the horizontal coordinate x in [x t × t, (x t + 1) × t). Consequently, the computation of a tile can be completed by looping over the coordinate ranges of both y and x. Note that the jobs whose coordinate y > x do not need to be computed since they lie beyond the upper triangle of the job matrix.
For all-pairs PCC computation, every job has the same amount of computation. In this case, the ideal load balancing policy is supposed to distributing identical number of jobs onto each PE. Herein, a thread on the Phi is referred to as a PE. From section II, we know that each core on the Phi has four hardware threads and a two-level private L1/L2 cache hierarchy with caches being connected via a bidirectional ring bus to provide coherent caching on the entire chip. This non-uniform cache access reminds us that we should keep the active hardware threads per core sharing as much data as possible with the intention to improve caching performance. In these regards, our tiled computation schedules a tile to t threads and lets these t threads to compute the tile in parallel. Note that we must guarantee that the number of threads is a multiple of t within the parallel region of our Phi kernel. Algorithm 1 shows the pseudocode of the Phi kernel of our tiled computation. Due to the limited amount of device memory, we are not able to entirely reside the resulting n × n correlation matrix R in the Phi for large n values. To address this problem, we partition the title identifier range [0, m(m + 1)/2) into a set of equal-sized non-overlapping sub-ranges and adopt a multi-pass execution model by letting one pass compute one sub-range, denoted by [J start , J end ) for simplicity. To match this execution model, a result buffer R of size (J end − J start ) × t 2 elements is allocated on the Phi and used to store the results of the J end − J start tiles. For each tile, its t 2 results are consecutively placed in R . Once a pass finishes, we transfer R to the host, then extract the results per tile and finally store them in the resulting matrix R allocated on the host.
To benefit from the wide 512-bit SIMD vector instructions, we have aligned each l-dimensional variable in U to 64-byte memory boundary on the Phi. In Algorithm 1, compiler directives are used to hint to the compiler to autovectorize the inner-most loop (lines 18 and 20 in Algorithm 1). Alternatively, we also manually vectorized the loop using SIMD instrinsic functions. The SIMD instrinsic functions used are mm512 setzero ps/pd, mm512 load ps/pd, mm512 fmadd ps/pd, mm512 mask3 fmadd ps/pd, and mm512 reduce add ps/pd for single/double precision floating point. Interestingly, our manual-vectorization did not demonstrate obvious/significant performance advantage to auto-vectorization through our evaluations. More specifically, our manual-vectorization does run faster than autovectorization, but only by a tiny margin, on a single Phi. Considering that auto-vectorization is more portable than hard-coded SIMD intrinsic functions, we have used autovectorization all through our implementations.
3) Asynchronous kernel execution: As mentioned above, we rely on multiple passes of kernel execution to complete all-pairs computation. Conventionally, having completed one pass, we transfer the newly computed results to the host, and do not initiate a new kernel execution until having completed the processing of the new results. In this way, the co-processor will be kept idle, while we transfer and process the results on the host side. A better solution would be to employ asynchronous kernel execution, which enables concurrent execution of host-side tasks and deviceside kernel execution. Fortunately, the offload model provides the signal and wait clauses to support for asynchronous data transfer and kernel execution. More specifically, the signal clause enables asynchronous data transfer in #pragma offload_transfer directives and asynchronous computation in #pragma offload directives. The wait clause blocks the current execution until an asynchronous data transfer or computation has completed. Note that the signal and wait clauses are associated with each other via a unique value. In our implementation, we have used a double-buffering approach to facilitate asyn- 
12:
if (x < n) then
13:
for (y = yt × t; y < min{n, (yt + 1) × t}; + + y) do 14: 
27:
} //parallel region 28: end procedure chronous computation. Algorithm 2 gives the pseudocode of our asynchronous implementation.
D. Distributed Computing
On Phi clusters, two distributed computing models can be used to develop parallel and distributed algorithms. One model is MPI offload model, which launches MPI processes just as an ordinary CPU cluster does. The difference is that one or more Phi coprocessors will be associated to a parental MPI process and this parental process will utilize offload pragmas/directives to interact with the affiliated Phis. In this model, communications between Phis have to be explicitly managed by their parental processes and it is not a necessity for Phis to be aware of the existence of remote communications between MPI processes. The other model is symmetric model, which treats a Phi as a regular computer interconnected to form a compute cluster. One advantage of symmetric model to MPI offload model is that symmetric model allows for the execution of existing MPI programs designed for CPU clusters to be directly executed on Phi clusters, with no need of re-programming the code. Nonetheless, considering different architectural features between CPUs and Phis, some amount of efforts may have to be devoted to performance tuning on Phi clusters.
In LightPCC, we used MPI offload model with the requirement of one-to-one correspondence between MPI processes and Phis. This pairing is straightforward for the cases launching one MPI process into one node. However, it would become more complex when a node has multiple Phis available and multiple processes running. This is because 
21:
Transfer num elements in R out from device to host
22:
Process the newly computed results on the host 23:
end while Process the results of the completed kernel 24 :
25:
if (num > 0) then
26:
27:
Process the newly computed results on the host 28:
end if 29: end procedure multiple processes launched into the same node have no idea about which Phi should be associated to themselves.
To address this problem, we have used the registrationbased management mechanism used by [13] for paring MPI processing and Phis. Our distributed implementation is also based on tiled computation on the Phi. Given p MPI processes, we evenly distribute tiles onto the p processes with the i-th (0 ≤ i < p) process assigned to compute the tiles whose identifiers are in [i × m(m + 1)/2p , (i + 1) × m(m + 1)/2p ). Within each process, we adopt the same asynchronous control workflow with the computation for single Phis (see Algorithm 2) and execute the same tiled computation kernel on the affiliated Phi in each pass (see Algorithm 1) . Note that the initialization of variables J start and J stop (lines 2∼3 in Algorithm 2) must be changed accordingly for each process. Concretely, p i should initialize J start to be i × m(m + 1)/2p , and J stop to be (i + 1) × m(m + 1)/2p .
E. Variable Transformation on Xeon Phi
As mentioned above, we reformulate the computation of PCC by transforming each original variable X i to a new representation U i based on Equation (4). This variable transformation for input set X only needs to be done once beforehand, and is also embarrassingly parallel since the transformation per variable is mutually independent. On the other hand, each variable requires identical amount of computation since these variables have the same dimension.
In these regards, we parallelize the variable transformation by evenly distributing variables onto all threads on the Phi. Algorithm 3 gives the pseudocode of the Phi kernel of our variable transformation. In Algorithm 3, for each variable X i the transformation consists of three steps. Step 1 (lines 7∼13) computes the mean of all elements in X i and requires l unit arithmetic operations. Step 2 (lines 14∼20) computes the variance of all elements and takes 2l unit arithmetic operations if considering a fused multiply-add operation as a unit one. Step 3 (lines 21∼25) finishes the transformation of X i to U i and also needs 2l unit arithmetic operations. Therefore, the total computational cost of variable transformation can be estimated as 5l unit arithmetic operations. On the other hand, for symmetric all-pairs computation using Equation (5), its computational cost can be estimated to be ln(n + 1)/2 unit arithmetic operations. Consequently, the overall computational cost of our method can be estimated to be 5ln + ln(n + 1)/2 unit arithmetic operations. 
IV. PERFORMANCE EVALUATION
We evaluated LightPCC from three perspectives: (i) performance comparison with the sequential ALGLIB (version 3.10.0), (ii) performance comparison with an implementation based on the cblas_dgemm GEMM routine in MKL, which first transforms variables based on Equation (4) (refer to Algorithm 3) and then applies GEMM based on Equation (5), and (iii) parallel scalability evaluation on a Phi cluster, using a set of artificial and real gene expression datasets. All tests are conducted on 8 compute nodes in CyEnce HPC Cluster (Iowa State University), where each node has two Intel E5-2650 8-core 2.0 GHz CPUs, two 5110P Phis (each has 60 cores and 8 GB memory) and 128 GB memory. Every program is compiled by Intel C++ compiler v15.0.1 with option -fast enabled. Meanwhile, when two processes run in a node, we used the environment variable I_MPI_PIN_PROCESSOR_LIST to guide Intel MPI runtime system to pin two processes per node to distinct CPUs (recall that a node has two CPUs).
For LightPCC, we set the tile size to 4×4 (i.e. t = 4) and configure each core to run four hardware threads associated with the compact OpenMP thread affinity mode (i.e. 236 actives threads on 59 cores). In this implementation, we schedule one tile to a core at a time, and let all four threads per core compute the same tile in parallel, with one thread processing one column. In this way, the four threads per core will access the same row variable, thereby improving data sharing. However, even though a 5110P Phi executes four hardware threads per core in order [12] , the four threads on each core are actually scheduled individually and independently by the operating system. Hence, we cannot guarantee that the four hardware threads per core always work on the same tile at any instant. In this regard, we have introduced a software-based centralized barrier [21] , which is implemented using the atomic intrinsic function __sync_fetch_and_sub, in order to synchronize the four hardware threads per core each time they finish their computation on a tile (note that cores are configured to have independent software barriers). Unfortunately, we observed slight performance decrease after using software barriers through our evaluations. In this regard, we have decided not to use software barriers both in our implementation and following tests. In addition, for each evaluated program, we used double-precision floating point for fair comparison and averaged its five runs to get the runtime.
A. Evaluation on Artificial Gene Expression Data
We first evaluated the performance of LightPCC, ALGLIB and the implementation using MKL (refer to as MKL for short in the following) using three artificial gene expression datasets by randomly generating gene expression values in [0, 1]. This is reasonable because the runtime of PCC computation is merely subject to n and l and independent of specific values. They are randomly generated by setting n to 16,000 (16K), 32,000 (32K) or 64,000 (64K) and l to 5,000 (5K). Table I shows the performance comparison between Light-PCC and ALGLIB. Compared to ALGLIB, LightPCC runs 12.9×, 17.4× and 20.6× faster using one Phi and 160.1×, 190.2×, and 218.2× faster using 16 Phis for n=16K, n=32K and n=64K, respectively. Moreover, the speedup gradually increases as n grows larger. It is worth mentioning that many applications require determining the statistical significance of pairwise correlation. For this purpose, permutation test is a frequently used approach for statistical inference. However, this approach needs to repeatedly permute vector variables at random and compute pairwise correlation from the random data, where the more iterations (typically ≥1,000 iterations) are conducted, the more precise statistical results (e.g. Pvalue) can be expected (except for the cases of complete permutation tests that rarely happen). In this case, the runtime with a specified number of permutation tests can be roughly inferred from the runtime per iteration and the number of iterations conducted. 16 Phis to the 64K dataset. Moreover, for both comparisons, the speedups of LightPCC over MKL are observed to increase as n becomes larger. In addition, we evaluated MKL on a single Phi by only using the smallest 16K dataset, because a Phi does not own enough device memory to have the correlation matrix R reside entirely in memory. This is also one reason why we adopted the aforementioned multi-pass solution with asynchronous kernel execution. Through our test, the Phi-based MKL took 8.8 seconds to finish the computation, outperforming our LightPCC by a factor of 3.11 on a single Phi. Nevertheless, it should be noted that our algorithm is designed to enable fast processing of very large datasets by overcoming single Phi device memory limitation and leveraging Phi clusters. In contrast, these features cannot be easily realized by MKL.
As for parallel scalibility with respect to varied number of Phis (see Fig. 2a ), LightPCC achieves an average speedup of 1.6 by using 2 Phis, 3.0 by using 4 Phis, 6.0 by using 8 Phis and 11.3 by using 16 Phis, compared to the execution on a single Phi. Accordingly, the maximum speedup is 1.8, 3.5, 7.0 and 12.4, respectively. 
B. Evaluation on Whole Human Genome Expression Data
We further used a real whole human genome expression dataset to conduct performance comparison. This real dataset is obtained from SEEK [22] , a computational gene coexpression search engine that supports queries against very large transcriptomic data collections and also publicizes thousands of human datasets from diverse microarray and high-throughput sequencing platforms (available at http:// seek.princeton.edu). This dataset consists of 17,555 genes of 5,072 samples each and is extracted from the GPL570 gene expression data collection produced by Affymetrix Human Genome U133 Plus 2.0 Array.
On this real dataset, compared to ALGLIB, LightPCC runs 13.7× faster on a single Phi, 21.6× faster on 2 Phis, 44.0× faster on 4 Phis, 85.4× faster on 8 Phis, and 162.8× faster on 16 Phis (see Table III ). In comparison to singlethreaded MKL, LightPCC runs 4.7× faster on a single Phi and 55.6× faster on 16 Phis. When it comes to 16threaded MKL, LightPCC is not able to outperform the former until ≥ 4 Phis are used, similar to the assessment using artificial datasets. For this case, our algorithm reaches a maximum speedup of 4.3 with 16 Phis. Furthermore, we evaluated MKL on a single Phi as well. The experimental result showed that it took 11.6 seconds for Phi-based MKL to finish the computation, resulting in a speedup of 2.77 over our algorithm on one Phi. As for parallel scalability, LightPCC also demonstrates good performance (refer to Figure 2b) , where compared to the execution on one Phi, the speedup is 1.6 on 2 Phis, 3.2 on 4 Phis, 6.2 on 8 Phis and 11.9 on 16 Phis, respectively.
V. CONCLUSION
PCC is a correlation measure investigating linear relationship between continuous random variables and has been widely used in Bioinformatics. For instance, one popular application is to compute pairwise correlation between gene expression profiles and then build a gene co-expression network to identify common regulation and thus common functions. In addition, PCC can be applied to some computational problems (e.g. feature selection [6] and correlation clustering [23] ) in machine learning as well.
In this paper, we have presented LightPCC, the first parallel and distributed all-pairs PCC computation algorithm on Phi clusters. It is written in C++ template classes and harnesses three levels of parallelism (i.e. SIMD-instructionlevel parallelism, thread-level parallelism and acceleratorlevel parallelism) to achieve high performance. Furthermore, we have proposed a general framework for workload balancing in symmetric all-pairs computation. This framework assigns unique identifiers to jobs in the upper triangle of the job matrix and builds bijective functions between job identifier and coordinate space.
We have evaluated the performance of LightPCC using a set of gene expression profiles and further compared it to a sequential C++ implementation in ALGLIB and an implementation using the cblas_dgemm GEMM routine in MKL, both of which run on the CPU. Performance evaluation showed that LightPCC runs up to 20.6× faster than ALGLIB on one 5110P Phi and up to 218.2× faster on 16 Phis, with a corresponding speedup of up to 6.8 on one Phi and up to 71.4 on 16 Phis over single-threaded MKL. Besides, LightPCC yielded good parallel scalability with respect to varied number of Phis. As part of our future work, we plan to apply this work to construct genome-wide gene expression network (e.g. from conventional microarray data [24] , emerging RNA-seq data [25] [26] or diverse genomic data [27] ) and integrate it with statistical and graph analysis methods to identify critical pathways. In addition, our current implementation does not distribute PCC computation onto CPU cores. Therefore, we expect to further boost its performance by employing an alternative CPU-Phi coprocessing model that enables concurrent workload distribution onto both CPUs and Phis.
