De novo assembly is a widely used methodology in bioinformatics. However, the conventional short-readbased de novo assembly is incapable of reliably reconstructing the large-scale structures of human genomes. Recently, a novel optical label-based technology has enabled reliable large-scale de novo assembly. Despite its advantage in large-scale genome analysis, this new technology requires a more computationally intensive alignment algorithm than its conventional counterpart. For example, the runtime of reconstructing a human genome is on the order of 10,000 hours on a sequential CPU. Therefore, in order to practically apply this new technology in genome research, accelerated approaches are desirable. In this article, we present three different accelerated approaches, multicore CPU, GPU, and FPGA. Against the sequential software baseline, our multicore CPU design achieved an 8.4× speedup, while the GPU and FPGA designs achieved 13.6× and 115× speedups, respectively. We also discuss the details of the design space exploration of this new assembly algorithm on these three different devices. Finally, we compare these devices in performance, optimization techniques, prices, and design efforts.
INTRODUCTION
The ability to construct de novo assemblies is widely pursued for medical and research purposes. These de novo assemblies are especially invaluable in the studies of structural variations of genomes [Birol et al. 2009 ]. However, the conventional short-read technology-based de novo assemblies provide structural information only on a microscale (<1,000 bases per fragment). They are not capable of reconstructing the large-scale structures of human genomes [Hastie et al. 2013] . This is due to the Authors' addresses: P. Meng, M. Jacobsen, M. Kimura, and R. Kastner, Department of Computer Science and Engineering, University of California, San Diego, 9500 Gilman Drive, Mail Code 0404, La Jolla, CA 92093-0404, USA; emails: pmeng@ucsd.edu, mdjacobs@ucsd.edu, motoki.kimura.xr@renesas.com, kastner@ ucsd.edu; V. Dergachev, T. Anantharaman, and M. Requa, BioNano Genomics, 9640 Towne Centre Dr #100, San Diego, CA 92121, USA; emails: {vdergachev, tanantharaman, mrequa}@bionanogenomics.com. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or permissions@acm.org. c 2016 869-0481, or permissions@acm.org. c ACM 1936 869-0481, or permissions@acm.org. c -7406/2016 /09-ART18 $15.00 DOI: http://dx.doi.org/10.1145/2840811 fact that using the short-read-based assembly leads to ambiguity when these largescale (>100,000 bases per fragment) genomes have frequent structural repetitions (typical medium to large genomes contain 40% to 85% repetitive sequences [Zuccolo et al. 2007] ).
In recent years, research has shown that a novel optical label-based technology is able to overcome this limitation of the short-read technology . This novel technology fluorescently labels the DNA molecule strings at the locations where a specific nucleobase combination appears (e.g., label wherever the combination GCTCTTC appears, as demonstrated in Figure 1(a) ). Then the labeled DNA molecules are linearized by being passed through a nanochannel device. These linearized strings with labels are imaged by a CCD camera as demonstrated in Figure 1 (b) . In the image field, on each string, the physical distances between every two labels are measured and collected. This process results in a uniquely identifiable sequence-specific pattern of labels to be used for de novo assembly. As opposed to the four letters, these arbitrary physical distances are unlikely to contain structural repetitions. Therefore, this optical method enables the reconstruction of the large-scale genomes for modern bioinformatic studies. In genomic studies, N50 is a widely used metric to measure the ability of a technology to assemble large-scale structures. Research results show that this novel optical assembly enhances the N50 by two orders of magnitude compared to the shortread assembly [Hastie et al. 2013] .
The task of the de novo assembly is reconstructing the genome from a set of DNA fragments. The most computationally intensive part of this task is the algorithm that aligns every pair from the DNA fragment set. This pairwise alignment algorithm for the optical assembly is fundamentally different from the short-read alignment. In the conventional short-read-based process, as depicted in Figure 2 (a), the alignment algorithm is applied on the strings with "A," "C," "G," or "T" DNA nucleobase letters. As opposed to the short-read letters, the new optical method aligns the locations of the fluorescent labels on the strings shown in Figure 2 (b) . Aligning these arbitrary numbers obtained from a human genome takes nearly 10,000 hours on a sequential CPU. Moreover, research [Baday et al. 2012] has shown that the resolution of the optical label method can be further enhanced by adding multiple types (colors) of labels.
Therefore, accelerating this alignment algorithm is desired not only for the purpose of shortening the process time but also for enabling this optical-based technology in genome studies that require high resolutions. In this article, we present three accelerated approaches for the optical labeled DNA fragment alignment using multithread CPU, GPU, and FPGA. These designs are compared against a single-thread sequential CPU implementation. The major contributions are as follows:
-The first attempt to accelerate the large-scale genome assembly on hardware -An end-to-end FPGA accelerated implementation -A GPU accelerated implementation -A comparison and design-space exploration of the multicore CPU, GPU, and FPGA This article is the extension of our previous publication [Meng et al. 2014] . We particularly extend the methodology of the FPGA design-space exploration and the hardware comparison. We also investigated the relationship between the accelerator performance and the number of alignments per host-device data transfer.
The rest of the article is organized as follows. We discuss related work in Section 2. We describe the alignment algorithm in Section 3. This is followed by descriptions of the accelerated designs in Section 4. Experimental performance results are provided in Section 5. We compare the hardware accelerators in Section 6. We conclude in Section 7.
RELATED WORK
Multiple accelerated approaches for short-read assembly have been proposed in recent years. A multi-FPGA accelerated genome assembly for short-reads was proposed in Olson et al. [2012] . They accelerated the alignment algorithm on the FPGAs for the reference-guided genome assembly with 250× and 31× speedups reported against the software implementations BFAST and Bowtie, respectively. An FPGA accelerated de novo assembly for short-reads was presented in Varma et al. [2013] . They chose to accelerate a preprocessing algorithm on the FPGA to reduce the short-read data for the CPU assembly algorithm. They reported a 13× speedup over the software. They also proposed an improved FPGA implementation exploiting the hard embedded blocks such as BRAMs and DSPs in Varma et al. [2014] . Attempts have also been made to accelerate genome assembly on GPUs. A GPU accelerated approach for short-read assembly was proposed in Aji et al. [2010] . They reported a 9.6× speedup. A GPU accelerated DNA assembly tool, SOAP3, was proposed in Liu et al. [2012] , which achieves a 20× speedup over Bowtie.
Although these approaches have improved the performance of the short-read assembly significantly, they are limited to microscale genomes. There is still no highperformance solution for large-scale genome structure analysis. Our implementations provide an accelerated solution for this large-scale genome task.
Our implementations are fundamentally different from these previous efforts because they employ the novel optical label-based genome assembly. In this article, our accelerated designs differ from the previous short-read approaches in two ways: (1) the data in the optical method requires more precision bits than conventional four letters (A, C, G, T) do, and (2) the physical label locations require a different alignment algorithm [Valouev 2006 ] from the traditional Smith-Waterman algorithm. Most short-read methods employed the traditional Smith-Waterman algorithm, which computes each score matrix element from its three immediately adjacent elements. The algorithm in our optical label-based method computes each element from a 4 × 4 area as demonstrated in Figure 5 . These differences not only increase the computational intensity but also require a different hardware parallel strategy from the ones proposed in these previous short-read-based works. To the best of our knowledge, our implementations are the first attempt to accelerate the large-scale genome assembly using GPUs and FPGAs.
Our goal is to align every pair of floating-point number arrays that represent the physical distances of the optical labels on the DNA fragments. As shown in Figure 3 , for arrays X and Y , we decide whether the alignment (X j aligned to Y i ) is valid based on three evaluations. First, as shown in Figure 3 (a), we need to evaluate the similarity between X j and Y i , which is intuitive. Second, as depicted in Figure 3 (b), we need to evaluate the boundary offset penalty. When X j is aligned to Y i , the leftmost ends of X Fig. 4 . Overall algorithm and its hardware partitioning. We accelerate the local score stage due to its computational intensity. In our partitioning, we also assign the boundary score and max score search stages to the accelerator to avoid intensive device-PC data communication. These intuitive evaluations of the alignment are realized by a dynamic programming method specifically modified for the optical DNA analysis by Valouev [2006] . The overall flow diagram of the algorithm is demonstrated in Figure 4 . The algorithm aligns two arrays X and Y of optical label positions by computing a likelihood score matrix and finding the maximum within this matrix. Each score represents the likelihood of a possible alignment between X and Y . Assuming the sizes of the input arrays are M and N, the algorithm computes an M × N score matrix as depicted in Figure 5 . The computation of each element in the matrix requires local scores. The black square in the figure shows an example of a local score. Those elements near the edges, shown as the gray regions in the figure, also require boundary scores. Thus, the alignment algorithm consists of three steps: (1) compute the boundary scores as described in Algorithm 1, (2) compute the local scores as described in Algorithm 2, and (3) find the best score and its correspondent (i, j) in the score matrix as shown in lines 10 to 12 of Algorithm 2. If the best score passes the threshold, then we find an alignment between X and Y with X j aligned to Y i using a trace-back operation. In our hardware accelerated approaches, we keep the trace-back operation on the host PC. We therefore only describe the best score computation in detail as follows. The computation of the boundary scores is described in Algorithm 1. In the algorithm, to compute a boundary score element located at (i, j), we first compute its leftmost offset Lx i, j or Ly i, j as shown in lines 12 to 20. Then we compute an "end" likelihood and several mixed likelihoods as shown in lines 21 to 30. We choose the maximum among these likelihoods to be the boundary score for this position. This process is iterated, as shown in lines 4 and 11, to produce the boundary scores for the top four rows and the leftmost four columns of the score matrix. An identical boundary score algorithm is also applied on the rightmost offsets of the input arrays to fill the bottom four rows and the rightmost four columns of the score matrix. These boundary score locations are visualized in Figure 5 . for h = max(1, j − 4) to j − 1 do 7: We compute a local score to represent the similarity between X j and Y i as well as the similarities between X j 's neighbors and Y i 's neighbors. In Algorithm 2, to compute each local score score i, j , we generate 16 score candidates correspondent to its upper-left 4 × 4 neighbors (refer to Algorithm 2, lines 5-7). Each of the 16 candidates is computed by adding a local likelihood (this represents the similarity between X j and Y i ) to its correspondent previous score (this represents the similarity between X j 's neighbors and Y i 's neighbors) from the 4 × 4 area (the shaded area in Figure 5 ). The score in score i, j is updated with the maximum among all these 4 × 4 candidates. This process is iterated M × N times to generate the complete score matrix as shown in lines 3 and 4. Then we find the highest score within the matrix (lines 10-12), which represents the best alignment for X and Y . This highest score is used in the post processes to complete the genome reconstruction.
The likelihood functions in Algorithms 1 and 2 are derived from an error model proposed in Valouev [2006] . The functions Likelihood local (x, y, m, n) and Likelihood end (x, m, n) are computed as shown in Equations (3) and (4), respectively. The Likelihood local (x, y, m, n) function consists of two terms: the bias value B XY (provided in Equation (1)) and the maximum between the penalty value (provided in Equation (2)) and a constant P Outlier Penalty . The values of the constants used in Equations (1) through (4) are empirically tuned to suit the optical experiment [Valouev 2006 ]. Changing these values does not influence the computing speed of the algorithm. Therefore, without loss of generality, in our implementations, we tuned these constants to suit our experiment input data-a synthetic human genome. These constant values are listed in Table I .
Let F represent the number of DNA fragments of an assembly process. Let M be the number of labels on the fragment. The algorithm requires O(F 2 N 2 ) times of Likelihood local operations to complete an assembly process. The DNA fragment pool typically has 100,000 to 1,000,000 arrays. A typical input array length (M or N) is 15 to 100 elements. Therefore, the number of Likelihood local operations, in a human genome assembly process, is on the order of 10 15 . The total amount of computations requires more than 10,000 hours on a sequential CPU.
Each element of the input arrays represents a distance, which is on the order of thousands of bases, between two neighboring optical labels on the actual DNA fragment. The synthetic data used in our implementations is designed to simulate these properties of the real-world human genomes. Since the data ranges and array lengths are similar, the computation performance tested with this synthetic data reflects the performance with the real-world genomes. Our focus is to accelerate this pairwise algorithm that aligns the optical labeled DNA molecule fragments to construct the contigs in the assembly process. The scaffolding process, using optical labeled contigs, is not a computationally intensive operation and can usually be performed on a sequential computer in 10 to 30 minutes.
ACCELERATED DESIGNS
We partitioned the algorithm by accelerating some parts on the hardware and keeping some parts on the PC. This partitioning strategy is depicted in Figure 4 . The most computationally intensive stage of the algorithm is the local score computation. Therefore, in our partitioning, we accelerated the local score computation on the hardware. The two stages, boundary score computation and maximum score search, are not as computationally intensive as the local score stage. However, these two stages have significantly intensive data communication with the local score stage. In order to avoid this communication bottleneck between the PC and the hardware accelerator, we also assigned these two stages on the accelerator. The path trace stage consists of control-intensive operations. Therefore, we kept this stage on the PC. We identified three levels of possible parallelism in the algorithm (from coarse grained to fine grained): (1) align multiple pairs in parallel, (2) compute multiple elements (rows or columns) in the score matrix in parallel, and (3) compute the 16 likelihood scores for each score element in parallel. These three levels and their hierarchy are depicted in Figure 6 . Particular computation and data reuse patterns exist in each level of possible parallelism. These patterns create tradeoffs in hardware accelerated designs.
When using level 1 parallelism (processing multiple pairs in parallel), each pair is data independent. Therefore, data communications or synchronization between parallel processes do not exist. However, it requires more computing resources to manage multiple alignments concurrently as well as more storage resources for intermediate data (e.g., multiple score matrices). On the other hand, the other two levels of parallelism (levels 2 and 3) provide more opportunities to share or reuse the data between the parallel processes due to their finer granularity. However, these two finegrained levels of parallelism may introduce performance challenges such as higher synchronization overhead on processors and placement and route complexity on FPGAs. These complex architectural tradeoffs create design-space exploration problems. We explored these design spaces to determine the proper level or combination of levels of parallelism to match the architectural features on the hardware. We also applied multiple optimization techniques on each design. We applied SIMD instructions and multithread techniques on the CPU design. For the GPU design, we tuned the CUDA code to tackle the data dependency caused by the local score computation. We also implemented a low-level FPGA design due to the inefficient resource utilization provided by the state-of-the-art high-level synthesis tools. In the following sections, we describe the design-space explorations and the optimal designs in detail.
Multicore CPU
In the CPU design, we first improved the locality of the program by dictating the compiler to store the highly reused variables in the CPU registers. We then parallelized the algorithm by inserting OpenMP directives. The performance is highly correlated with the granularity of the iterations in the algorithm. We evaluated the fine-grained strategy that processes multiple rows and columns in parallel on the multiple CPU cores. The evaluation results indicated that it is expensive to synchronize and exchange fine-grained data among the cores. The multicore CPU is more suitable for the coarsegrained parallelism. Therefore, we chose to align multiple pairs in parallel on the multicore CPU.
We divided the total workload into several sets of alignment tasks and assigned each of the sets to a CPU core as demonstrated in Figure 7 . When one CPU core finishes its current alignment workload, it can start aligning another pair immediately without synchronizing with the other CPU cores. This setup does not create "dead" parallel processes or threads when the input array sizes change during the runtime. Therefore, all the CPU cores are completely occupied during this process. In addition, within each core, the process is in a sequential fashion, which is suitable for controldominated operations such as the boundary score computation. We also forced functions Likelihood local (x, y, m, n) and Likelihood end (x, m, n) to be static and inlined in order to provide more optimization opportunities for the compiler.
The computations of the Likelihood local function provide us an opportunity to utilize the CPU SSE SIMD instructions. Therefore, we program the Likelihood local function to process four elements with an SIMD fashion using the "__m128" type of its intrinsic operands.
GPU
The GPU design consists of three CUDA kernels, invoked from a C++ host code. The CUDA kernels accelerate the alignment algorithm to keep the intermediate data on the GPU during the process. The C++ host program only sends the input DNA arrays to the first kernel and receives the output maximum score from the third kernel.
There are multiple options for CUDA kernel design based on different levels of granularity. We first evaluated the coarse-grained-only strategy on the GPU. The evaluation shows that coarse-grained parallelism is significantly bounded by a low GPU occupancy. Therefore, to fully utilize the GPU parallel computing power, we added fine-grained parallelism in our design. The GPU design computes multiple rows and columns in fine-grained parallel within each GPU thread block. The design also utilizes multiple thread blocks to align multiple pairs in coarse-grained parallel. Computing the 16 candidates in parallel is not efficient on the GPU since it requires a 16-element reduction process, which creates idle threads frequently.
We partitioned the algorithm into three CUDA kernels: (1) boundary score kernel, (2) dynamic programming kernel, and (3) maximum score search kernel. We chose this kernel partitioning because these parallelized computations require GPU global synchronization after (1) and (2).
In the boundary score kernel design, we fully parallelized the computations due to the data independency. The GPU thread arrangement is as follows: assigning the boundary score computation for each element (lines 12-30 in Algorithm 1) to one GPU thread, and assigning the boundary score computations of each alignment to one GPU thread block. With this design, we maximized the GPU parallel resource occupancy. Moreover, since this design assigns all the computations of an alignment to the same thread block, we were able to store the intermediate data in the shared memory to minimize the memory access delay in the computations.
The pseudo code of the dynamic programming kernel is described in Listing 1. We parallelized the score element computations using N × 4 threads in each thread block. The candidate score computation for each matrix column requires four previous columns, as described in line 7 of Algorithm 2. Parallelizing this part of the algorithm is a challenging task due to this data dependency. We overcame this issue by dynamically assigning the columns of the score matrix to four groups of threads. As described in Listing 1, we used threadIdx.y to partition N × 4 threads into four groups. They form a software pipeline. Each thread group is only responsible for a specific candidate computation (leftmost, second left, second right, or rightmost). By increasing col id, we stream the columns of the score matrix into this pipeline.
The example in Figure 8 shows a snapshot of this software pipeline when the GPU is processing columns 8, 9, 10, and 11. These columns are assigned to the different stages (thread groups) of the pipeline: column 11 to group threadIdx.y = 0; column 10 to threadIdx.y = 1; column 9 to threadIdx.y = 2; column 8 to threadIdx.y = 3. The computations in the pipeline stages threadIdx.y = 0 − 3 are leftmost candidates, second-left candidates, second-right candidates, and rightmost candidates, respectively, as shown in the shaded blocks in Figure 8 . In the snapshot, these computations all Fig. 8 . Visualized GPU kernel for dynamic programming, assuming the score matrix size is M× N. N rows × 4 columns of score elements are computed concurrently. For example, columns 8, 9, 10, and 11 are computed concurrently. At the given state in the example, column 11 is assigned to the threads whose threadIdx.y = 0. Then the leftmost candidates of column 11 are computed using the previously computed data in column 7. Similarly, columns 10, 9, and 8 are assigned to threadIdx.y = 1, threadIdx.y = 2, and threadIdx.y = 3, respectively. N is set to a multiple of 32 to ensure each warp has the threads with the same threadIdx.y. require the data from column 7, which has already been computed in the previous col id iteration (refer to the "for" loop in Listing 1). Once the computations in the snapshot are finished, the data in column 8 is then ready. With the data from column 8, the pipeline streams a new column (column 12 in the snapshot) by increasing the iteration index col id. These four thread groups execute different instructions to implement the four stages of the pipeline. In order to fit this design on the GPU SIMD architecture, we ensured the threads of each GPU warp execute the same instruction by extending N to a multiple of 32.
Once the dynamic programming kernel finishes computing the score matrix, the third kernel searches the matrix to find score best , i best , and j best . We implemented this maximum score search kernel using the reduction approach. We kept the reduction process of each alignment within one thread block. Therefore, this process does not require the expensive global synchronization on the GPU. Then, we created multiple thread blocks to concurrently process the reductions for multiple alignments. We also applied shared memory and efficient warp arrangement in the reduction.
FPGA
Similar to the GPU, the FPGA also accelerates the algorithm by processing the computations in a parallel fashion. The FPGA is a customizable architecture. There are usually two ways to implement the parallelism on the FPGA: (1) replicate a logic module multiple times to physically create multiple parallel data paths and (2) pipeline the architecture to process the multiple data concurrently in a streaming fashion. A highperformance design requires a proper decision on which technique is used to implement each of the three levels of the algorithmic parallelism. Moreover, due to the FPGA Listing 1. Pseudo Code for Dynamic Programming GPU Kernel. resource constraints, a feasible design also requires the proper number of replications in each level of parallelism. There exists many possible settings of choices of parallel techniques and numbers of replications. In order to reduce the size of the design space, we first constructed a reasonable structure of the FPGA design based on heuristics. Figure 9 depicts this FPGA structure. Due to the algorithmic data dependency, it is impossible to replicate parallel data paths for both row and column dimensions. Therefore, we chose to only replicate the row parallel data paths (level 2 parallelism) and pipeline the column dimension (level 2 parallelism). We replicated the likelihood score units (level 3 parallelism) to sustain the throughput of this full pipeline in the column dimension. We also replicated the entire alignment module multiple times (level 1 parallelism) to maximize the overall throughput. We then permutated the numbers of parallel paths in this structure to find the optimal setting.
Implementing multiple RTL designs to measure the performances of these permutations requires a significant amount of effort. Therefore, exploring the FPGA design space using RTL designs is inefficient in terms of the development complexity. Instead of manually implementing multiple RTL designs, we propose a method using Vivado HLS, which enables rapid FPGA implementations to explore different parallel structures.
We restructured the original software C code, as described in Listing 2, into the format that represents the parallel hardware structure. We first constructed a function lkh_score() to implement the likelihood score computation in Equation (3). To implement the full pipeline, we restructured the local score computation code into a function pipeline_unit() with four pipeline stages. Lines 35 through 39 describe Fig. 9 . FPGA implementation of the three levels of parallelism: (1) replicate the overall architecture to process multiple alignments in parallel; (2) within each alignment score matrix, replicate the row modules to process multiple rows in parallel and pipeline multiple columns; (3) for each score element, replicate the likelihood score module to compute multiple likelihood scores in parallel. a line buffer used to feed the input into this pipeline. We then call pipeline_unit() in the alignment_module() function. We replicated multiple instances of lkh_score(), pipeline_unit(), and alignment_module() to generate parallel data paths in the three levels. Finally, we used function top_module() to wrap up these submodules. This new C-code structure eases the scheduling task for the HLS tool to generate efficient architectures.
We permutated the numbers of the three levels of data path replications in the restructured C code by modifying the limit parameter in the HLS ALLOCATION directives. We evaluated 10 different settings by running the entire HLS design tool chain including the placement and route phase. Figure 10 depicts the evaluations of these HLS designs. The experimental results indicate that design H achieves the highest throughput efficiency among all the evaluated designs. We then implemented design H in RTL to further improve the resource efficiency.
Our RTL FPGA design consists of two modules: (1) boundary score module and (2) dynamic programming and maximum score module. To achieve a high throughput, we fully pipelined the FPGA architecture to output a new likelihood score every clock cycle. The two modules are able to run concurrently in a streaming fashion.
The architecture of the boundary score module is described in Figure 11 . In this figure, we demonstrate the boundary score module by only showing an example computing the scores at the fourth top row. The other rows and columns are identical to this example. We replicated this architecture 16 times to process the top four rows, bottom four rows, left four columns, and right four columns of boundary scores in parallel. This boundary score module is fully pipelined and consists of control logic (the black blocks in the figure), arithmetic units (the gray blocks), muxer, and a shifting register for X. The control logic and arithmetic units correspond to Algorithm 1. The shifting register is for accessing X Lx i, j and X k as shown in lines 17 and 28.
The design of the dynamic programming module is described in Figure 12 . This architecture consists of five major pipeline stages as shown in Figure 12 are fully pipelined. Stages 1 through 4 compute the maximums of the leftmost, second left, second right, and rightmost columns of candidates, respectively.
We replicated the described architecture 5 times to process five rows of scores in parallel. After the last column of the current five rows, the next five rows will enter this architecture to continuously fill the pipeline. The output of all the rows are passed to a pipeline maximum module to find score best , i best , and j best . We chose to process five rows in parallel to match the throughput of the boundary score module. The two modules are thus able to run in a streaming fashion without idling. We used fixed-point numbers and arithmetic in the FPGA design. Due to the data range, we used 26 bits for the scores and 18 bits for the input arrays, both with 10 decimal bits. These fixed-point numbers can represent the optical labeled fragments up to 1,000,000 bases. This range covers most human genome assembly applications. In the score functions, we implemented the divisions using lookup tables.
Since the RTL implementation is more resource efficient compared with the HLS implementation, we were able to replicate the overall alignment architecture 5 times to align five pairs of DNA molecules concurrently. In the HLS design, we were only able to replicate this architecture 2 times. We enhanced the resource efficiency by more than 200% using the RTL design.
RESULTS
The input data for the alignment algorithm in our experiment consists of 16,642 DNA molecule fragments. Each fragment contains five to 182 labels. The range of the distances between labels is 500 − 3.79 × 10 5 bases. We tested the multicore CPU design on a 3.1GHz Intel Xeon E5 CPU with eight cores. The CPU design was compiled with O3 GCC optimizations. The GPU design was tested on an Nvidia Tesla K20 Kepler card. The FPGA design was implemented and tested on a Xilinx VC707 FPGA development board. The input data used in our experiments is a set of synthetic human genome sequences. Figure 13 presents the performance of our implementations. The baseline is a highly optimized C++ program without any parallelism. The average time for aligning two optical labeled molecules is 42.486μs in the baseline implementation. The runtimes for the boundary score, dynamic programming, and maximum score operations are 9.351μs, 30.594μs, and 2.541μs, respectively.
The OpenMP parallelized C++ program consumes 5.04μs aligning a pair of molecules on an eight-core CPU with hyperthread technology on each core. The performance of this multicore implementation achieves an 8.4× speedup, which is proportional to the number of cores. The extra 0.4× speedup is contributed by hyperthread. The CPU SSE SIMD technique boosts the performance with an 11× speedup against the baseline.
Our GPU implementation was written using the Nvidia CUDA 6.5 SDK. The GPU runs at a base frequency of 706MHz and has 2,496 CUDA cores. We varied the number of input alignments per host-device data transfer transaction from 10 to 10,240 to investigate how the GPU design performs. As shown in Figure 14 (b), the dynamic programming kernel converges to the minimal runtime after increasing the number of alignments per transaction to 2,560. The boundary kernel and the max reduction kernel converge to their minimums when the number of alignments per transaction hits 640. The data copying operation from the device to host keeps speeding up with the increase of the number of alignments. This is due to the fact that the output array only contains very little data (each alignment only generates one max score and two indices of the X and Y arrays), which never saturates the memory transaction bandwidth. However, the dynamic programming kernel dominates the overall runtime. Therefore, as a consequence, the overall performance saturates the max throughput after increasing the number of alignments to 2,560.
The best performance of the GPU design is at 3.116μs per alignment with a 13.6× speedup against the baseline. The runtimes for the boundary score, dynamic programming, and maximum score kernels are 0.940μs, 1.484μs, and 0.494μs, respectively. The data transferring time between the host memory and GPU memory is 0.198μs.
The FPGA design was built using Xilinx ISE 14.7 in Verilog. The FPGA design was implemented on a Xilinx Virtex 7 VC707 board receiving input and sending output using RIFFA [Jacobsen and Kastner 2013] (configured as an x8 Gen 2 PCIe connection to the PC). We also investigated how the FPGA design performs when changing the number of alignments per RIFFA transaction between the host CPU and the FPGA. We also varied the number of alignments per transaction from 10 to 10,240. The simulated design generates the ideal throughput of the FPGA acceleration module. We measured the bandwidth of RIFFA by sending the data to the FPGA and receiving the same data back to the host without doing any computation on the FPGA. We only measured the host-to-device bandwidth since the transfer from the device to host contains very little data. As shown in Figure 14(a) , the actual FPGA performance is significantly lower than the simulated ideal performance due to the RIFFA bandwidth limit when the number of alignments is between 10 and 640. After the number of alignments reaches 5,120, the actual FPGA performance converges to its maximum, which is slightly lower than the ideal performance due to the host software overhead.
Our FPGA experimental result shows the best throughput at 2.7 million pairs of molecules per second or equivalently 0.367μs per alignment. Thus, the FPGA implementation achieves a 115× speedup against the baseline. Our FPGA implementation runs at a frequency of 125MHz. Table II lists the resource utilization of the entire design including the PCIe communication logic. The design occupied 89% of the slices on the FPGA. In order to meet the timing constraint with such a high logic utilization, we used "SmartXplorer" to permutate multiple placement and route strategies in ISE.
Since we used fixed-point number representation in the FPGA design, compared to the baseline floating-point design, we observed a 0.019% error, which is negligible in real applications. Table III presents the summary of the comparison between the three hardwares. We compare the performances and prices of the hardware accelerators.
HARDWARE COMPARISON

Performance
Although the multicore CPU has the highest operating frequency among the three hardwares, it achieves the lowest speedup. This is due to the fact that the multicore CPU has very limited parallel computing resources: eight cores with hyperthread. These cores are not closely coupled in the architecture. Frequently synchronizing these cores for fine-grained parallel computations becomes significantly expensive. Therefore, we were only able to utilize these cores to align multiple molecule pairs in a coarsegrained parallel fashion. The SSE SIMD extension provides a limited level of finegrained parallelism. The CPU 128-bit SIMD extension does not provide dedicated SIMD units to achieve massive fine-grained parallelism. The GPU, as opposed to the multicore CPU, has an SIMD architecture that supports fine-grained parallelism. We therefore observed a higher speedup on the GPU. However, the control-dominated boundary score computations introduce a significant amount of diverse instructions, which harms the parallelism in the SIMD architecture. The GPU accelerates the dynamic programming algorithm by 20× while it only accelerates the boundary score algorithm by 10×. Compared with the GPU, each CPU coarse-grained parallel core is processing each alignment in a sequential fashion, which has more advantage in dealing with the control-dominated instructions. Moreover, the array size differences create multiple inactive threads. With the CUDA profiler "nvvp," we observed that these inactive threads occupy more than 45% of the GPU computing resource due to the control branch diversities. Compared with the GPU design, the multicore CPU and the FPGA suit this feature of the algorithm better. The multicore CPU coarse-grained parallelism avoids inactive threads. Unlike the GPU threads issued before the program starts and unchangeable during the runtime, the FPGA pipeline terminates and moves on to the next array when the current array finishes during the runtime. These unsuitable GPU features limit the performance for the alignment algorithm.
On the FPGA architecture, the customized logic avoids the diverse instruction issue in the boundary score algorithm. In the dynamic programming module, the pipeline on the FPGA is spatial. The data is transferred from one logic to the next logic using onchip registers. In opposition, in the GPU design, we implemented a similar optimization using warps (different threadIdx.y) as shown in Listing 1 and Figure 8 . Each GPU warp represents a logic module on the FPGA. Unlike the spatial pipeline, the GPU warps are scheduled temporally. The data is not transferred spatially between warps. In contrast, the warps read or write the data on the shared memory. Although these warps are designed to be processed efficiently on the GPU, the FPGA spatial pipeline still outperforms the GPU warps without the overhead from scheduling and memory access. Moreover, the boundary score module stores its output in the low-latency BRAM on the FPGA. The dynamic programming module can then access these boundary scores within one clock cycle. As opposed to the FPGA, the GPU dynamic programming kernel reads the boundary scores from the global memory with a higher latency. For these reasons, the FPGA implementation achieves the highest performance.
Hardware Prices
The prices of the acceleration hardware vary significantly depending on the complexities of the devices. The devices used in our implementations belong to the highend category. The Xilinx VC707 FPGA evaluation board costs about $3,495 [Xilinx 2014 ]. The Nvidia K20 GPU can be purchased for $3,200. These high-end GPUs and FPGAs have comparable prices. The high-end CPU, Intel Xeon E5, has a relatively lower price, which is roughly $2,000. In our comparison, we added the CPU price to the cost of the GPU and FPGA accelerated systems since both of them used the CPU as a host to send and receive data. In our application, the performances per dollar are 99.2 aligns/sec/$, 61.7 aligns/sec/$, and 495.5 aligns/sec/$ for the multicore CPU, GPU, and FPGA, respectively.
CONCLUSION
In this article, we have addressed the necessity to accelerate the optical label-based DNA assembly. We have presented three different accelerated approaches: a multicore CPU implementation, a GPU implementation, and an FPGA implementation. We have also presented the detailed design-space explorations for these three approaches. The speedups over the sequential CPU baseline are 8.4×, 13.6×, and 115× for the multicore CPU, GPU, and FPGA, respectively. Using spatial pipelines, the FPGA design has been customized to suit the algorithm more efficiently than the other two hardwares. The tradeoff to this performance efficiency on the FPGA is its significant design complexity in comparison with the approaches on the other two hardwares.
