1 Bit-parallel algorithms
General Integer Scoring
BitPAl [1] employs a scoring scheme based on the three integers for match (M ), mismatch (I), and gap (G) with the conditions M ≥ 0, I < 0, G < 0, and I ≥ 2G.
Consider two strings Q and T of length m and n, respectively. The similarity scoring matrix S is given by the following recurrence:
Initialization is given by S[i, 0] = i × G, S[0, j] = j × G for 0 i m, 0 j n. Bit-parallelism is achieved by computing and storing the difference (∆) values for rows or columns instead of absolute values of S. Let ∆V and ∆H be the vertically and horizontally adjacent differences:
The recurrence for ∆V can be rewritten as in Equation 3 . Each ∆-matrix can be encoded in a bit-vector fashion using M − 2G + 1 words. Each word records a specific value and each bit in the word records if the value appears in the respective cell using a one-hot encoding scheme. The algorithm proceeds in a row-by-row fashion through the two ∆-matrices. Different (∆V, ∆H) pairs thereby partition the function table into distinct zones with the boundaries of ∆V [i, j − 1] = I − G, ∆H[i − 1, j] = I − G, and ∆V [i, j − 1] = ∆H[i − 1, j]. As the bit-vectors record distinct values, the algorithm updates values zone-by-zone with a sequence of bitwise operations. Hence the complexity of the algorithm depends on the size of the function table, which is in turn determined by the scoring scheme; i.e. the complexity grows with larger values of G and M − G. To mitigate this behaviour BitPAl can also be implemented using a packed encoding which uses the bit vectors as binary representations for ∆H or ∆V values (instead of a one-hot encoding). Even though the packed method involves more operations, it tends to perform better than one-hot encoding for scoring schemes with large values due to the more compact encoding.
Unit-Cost Scoring
Myers' algorithm [2] can be implemented as a special of BitPal using the scoring scheme M = 0, I = −1, and G = −1 -denoted as BitPAl(0, −1, −1). The corresponding implementation using three bit-vectors per ∆-matrix is shown in Algorithm 1. In comparison the direct implementation of Myers' algorithm is usually faster than BitPAl(0, −1, −1) since it requires less bit operations (see Algorithm 2) . 
Algorithm 2 Direct implementation of Myers' algorithm
2 Bit-parallel Vectorization
We now present our word-based vectorization method for bit-parallel algorithms on SIMD vector units. Vector processing units (VPUs) divide long vector registers into a number of lanes to handle multiple data items in parallel using a single instruction. There are two sources of parallelism in our application scenario:
1. inter-sequence parallelism, and 2. intra-sequence parallelism.
The former assumes that a whole batch of sequences needs to be aligned (e.g. a query sequence is compared to a set of subject sequences). Thus, this approach can utilize data-level parallelism among different alignments; i.e. each vector lane computes a different (query-subject) alignment in parallel. The latter exploits the parallelism within a single pairwise alignment, i.e. different lanes compute different cells of a single DP matrix in parallel. For bit-parallel alignment algorithms, however, the intra-sequence parallelization approach may lead to a low utilization rate of vector registers. As already mentioned in Section 1, bit-parallel algorithms 
32-bit integer
32-bit integer 32-bit integer Figure S1 : Example of packing 16 DNA sequences into a 512-bit register. This example simply assumes that the length of DNA sequences is less than 32.
use each bit to code a single cell in the DP matrix. Thus, the (query) sequence length needs to be a multiple of the vector length in order to achieve full utilization. However, this is rarely the case in practice. For example, for a Xeon Phi with a vector length of 512 bits and NGS reads of length 100 bps, the corresponding utilization rate would only be around 20%. Therefore, we base our solution on the inter-sequence parallelization approach instead. Our target platforms include the SSE, AVX2, KNC, and AVX-512 instruction sets for multi-core CPU platforms and many-core Xeon Phi platforms, respectively. These instruction sets differ in vector lengths (ranging from 128-bit to 512-bit) and number of lanes. Our method should be able to adapt to these instruction sets in a flexible way by only changing the parameters L (vector lane length) and N (number of lanes). Figure S1 shows an example for L = 32 and N = 16.
An important consideration is the processing of carry bits. The ADD operation (required in Algorithms 1 and 2) can generate a carry bit that needs to be processed by the neighboring lane. The SSE, AVX2, and AVX512 instruction sets, however, discard the carry bit if we use the whole word as a bit vector. Thus, we leave the most significant bit (MSB) as a buffer bit in order to keep the carry bit. We then need to employ additional bit operations to get the carry bit between every two iterations. In contrast, the KNC instruction set provides the special instruction ADC that can return the carry bit while adding. This simplifies our implementation. Both carry addition approaches are illustrated in Figure S2 . In order to keep consistency, we change the ADD meta operation to ADC, and implement extension classes with L − 1 bits in each bit-vector and respective instructions for SSE, AVX2, and AVX-512, and L bits and keep the ADC instruction for KNC.
Modularized Parallel Framework
Our work targets to achieve device-level extensibility with native code instead of migrating to the unified OpenCL framework in order to use optimized kernel functions and to gain the best possible performance. We abstract computing devices using two types of modules: the global module and the local module. The global module accesses the sequence database and dispatches subject sequences to devices. The local module handles internal computation and buffer details. In particular, this includes passing the ... sequence buffer to compute functions and performing necessary memory transfers for PCIe-based manycore co-processors. To extend to a new type of processing unit (PU), we only need to re-implement the local module while the global module can be reused without changes. This approach thus allows for independent optimization of both the architecture-specific and the architecture-independent part of the application.
The Global Module
The global module in the BGSA framework loads database sequences and dispatches them to available compute devices. We first split the database into several equal-sized global data chunks. The global data chunks are further partitioned into smaller local data chunks for the respective devices, as is shown in Figure S3 . We use two global buffers in order to enable concurrent computation and buffer filling facilitating their overlapped execution.
In order to use the inter-task vectorization method described in Section 2, we divide each data chunk into sequence batches. Each sequence batch consists of Ω equal-length sequences. We further compute matching profiles for each subject sequence of a batch. As both considered bit-parallel algorithms require the matching information of the query sequence against the subject sequence batch, we build a matching profile by comparing each subject sequence with every single symbol in the alphabet. Figure S1 shows an example on a 512-bit VPU with the DNA alphabet Σ DNA = {A,G,C,T}. The matching profile enables a bit-parallel algorithm to use a single vector load instead of multiple instructions to compare the query symbol against a subject sequence. This reduces the core instruction count significantly. Using bit-vector encoding, the matching information only occupies a single bit for each database sequence symbol. Thus, 
Global IO Module

Encoding Encoding
Encoding Encoding Figure S3 : Our device-level data distribution framework with dynamic regulation.
the total memory occupation of the matching profile can be calculated as follows:
For BGSA we only process genomic data over Σ DNA with |Σ DNA | = 4. In this case our matching profile can compress the size of the sequence batches by half. Global data chunks are dispatched in local data chunks. We employ a dynamically regulated distribution strategy to balance the workload among different devices with minimal overhead. This strategy partitions each incoming data chunk according to a dynamically maintained splitting ratio table. The splitting ratio is adjusted with respect to the processing time of each data chunk. We use a set of weights to indicate the relative computing speed of each device in order to determine the splitting ratio. Let W i,j be the weight for device i processing the j th data chunk, and T i,j be the respective computing time. The update method is as follows:
The weight W i,j evaluates the relative speed of device i compared with device 1, and the corresponding splitting ratios R i,j are calculated with the following function:
However, this iteration is highly sensitive to recent performance fluctuations, which can lead to unstable performance. We address this issue by introducing a stabilizer S j , and updating the weight table with the following iteration:
where the S j series should be monotonously increasing. The new equation evaluates the weights with respect to their past performance whereby the latest record is most significant. Thus, our dispatching scheme is less sensitive to small fluctuations, while it is still able to respond to major performance changes. The degree of sensitivity is determined by the convexity of S j . A convex curve is more stable while a concave curve is more sensitive. The default setting of S j in BGSA is S j = j, which works well in practice.
The Local Module
The local module transfers data to device memory and performs actual computation. The compute kernels are also included in the local modules. For a general-purpose host system, data remains at the original memory position so that there is no transfer overhead. For PCIe-based co-processors, we implement the transfer method using a dual device buffer strategy to hide transfer latency. We provide a code generator written in Java to quickly generate local modules for a wide variety of devices as well as different scoring schemes. The code generator decouples the algorithm description from specific vector instruction sets by utilizing a class hierarchy. The UML diagram is shown in Figure  S4 . In this class hierarchy, the SIMDArch class defines several SIMD-based meta operations that will be used in the algorithms, in our case ADD, AND, OR, XOR, and NOT. These meta operations comprise a virtual instruction set which enables a device-independent description of an algorithm. For each instruction type, we implement an extension of the SIMDArch class, so that a single copy of the algorithm description can fit on a number of target architectures. The AlgorithmGenerator class describes the algorithm with the meta operations. Different extensions of this class generate kernel codes for different algorithms. We highlight that the value of the gap parameter affects the generation of kernel codes in the BitPAl algorithm; e.g. for the (0, −1, −1) scoring scheme we use the Myers' algorithm instead since it requires less core instructions. This approach allows for easy extensibility to other architectures and the addition of more bit-parallel alignment algorithms without being concerned about instruction set details.
Performance Evaluation
We evaluate the performance of BGSA on several multi-core and many-core platforms. Detailed hardware specifications are listed in Table 1 . The processors are installed in two servers. Server S1 consists of a dualsocket Xeon E5-2620v2 CPU and two KNC-based Xeon Phi-7110p co-processors connected via the PCIe bus running CentOS 6.5. The memory capacity is 16GB. Server S2 is the Intel KNL development platform equipped with a self-bootable Xeon Phi 7210 many-core processor, with 16GB on-package MCDRAM and 96GB DDR4 memory. MCDRAM is configured as cache mode and the cluster mode as quadrant mode. The server runs CentOS 7.2. Server S3 is equipped with a W-2123 CPU and 16GB DDR4 memory, and this server runs Ubuntu 16.04. All programs are compiled with icc v15.0.0 at O3 optimization level. For all experiments, we use reads generated by the read simulation tool Mason [3] from a human reference genome. The generated sequences range from 100 to 960 bps in length. The query set includes 1K sequences while the subject database is composed of 5M sequences. In order to make global sequence alignment meaningful, we only compare query sequences to subject sequences of the same length. All performance figures are averaged over 10 runs, and measured in terms of GCUPS (Billion Cell Updates Per Second ). Both the time for device/host data transfer and for computation are included in the GCUPS calculations for all target applications in our tests. However, the time for file I/O and format conversions are not included. Compared to other applications in our experiments, BGSA can do both the format conversion and file I/O more efficiently. In order to measure performance that demonstrate actual algorithmic efficiency rather than disk speed, we have decided not to include file I/O and format conversion timings into the GCUPS measurement.
The following are the details of the configuration of each tool with the latest version. Unless otherwise stated, other options of the mappers were configured as their default settings.
• Edlib: edlib-aligner -k threshold -s subject.fasta query.fasta.
• SeqAn: align bench par query.fasta subject.fasta -a dna -v -i 16 -alignment-mode search -t thread num.
• BitPAl: bitpal -q query.fasta -d subject.fasta.
• KSW2: ksw2-test -t algorithm -A match score -B mismatch score -O gap score query subject.
• Parasail: parasail aligner -d -x -e gap score -o gap score -X mismatch score -M match score -f query.fasta -q subject.fasta -t thread num -a nw striped ( avx2 256|sse 128) 16.
Single-Threaded Performance
We first compare single-threaded performance of different kernels on the target computing platforms using various SIMD extensions (SSE, AVX2, KNC, AVX-512). We use two different scoring schemes: (0, −1, −1) which is equivalent to Myers' algorithm, and the more general scoring scheme (2, −3, −5) (which was also evaluated in the BitPAl paper [1] ). For the (0, −1, −1) scheme we also compare BGSA to the Myers' algorithm implementation provided by the SeqAn package [4] . For the (2, −3, −5) scheme we compare BGSA to the original BitPAl implementation. We also compare performance to the SIMDvectorized NW implementation of the recently introduced Parasail library [5] . Figure S5 shows the results for the (0, −1, −1) and the (2, −3, −5) scoring scheme. BGSA outperforms the original BitPAl implementations by a factor of 2.3 on average on the SSE-based E5-2620 CPU. Furthermore, BGSA achieves an average speedup of 4.2 and 5.1 on a Xeon Phi-7210 processor and a W-2123 processor with AVX2 extension. With the 512-bit KNC and AVX512 extensions, BGSA outperforms BitPAl by a factor of 7.4, 10.1 and 8.1 on a Xeon Phi-7110 co-processor, a Xeon Phi-7210 and a W-2123 processor, respectively. The achieved speedups are actually better than the multiply factor of data parallelism due to our efficient framework design and more efficient instruction flow. We also note that the (0, −1, −1) kernels are significantly faster than the (2, −3, −5) kernels. This is due to the more complex core instructions to update a matrix cell.
For (2, −3, −5) scoring scheme, BGSA achieves the approximate equivalence performance on E5-2620 CPU with the SSE extension compared to Parasail SSE version. On a W-2123 processor, BGSA achieves On E5-2620 CPU, with SSE intrinsics, BGSA outperforms SeqAn NW by a factor 6.0 for (0, −1, −1) scoring scheme, and 22% slower than SeqAn NW for (2, −3, −5) scoring scheme. And with AVX-512 intrinsics, compared to SeqAn NW, BGSA achieves an average speedup of 10.6 and 24.6 on W-2123 processor and Phi-7210 processor for (0, −1, −1) scoring scheme, and an average speedup of 1.3 and 3.7 on W-2123 processor and Phi-7210 processor for (2, −3, −5) scoring scheme.
Compared to the implementation of Myers' algorithm in SeqAn, BGSA achieves an average speedup of 6.9 on the E5-2620 CPU with SSE extension. And with the AVX512 extension, BGSA outperforms SeqAn by a factor 13.6 and 22.1 on average on a W-2123 processor and a Phi-7210 processor. In addition, BGSA outperforms Edlib by a factor 5.3 on the E5-2560 CPU. With the AVX512 extension, BGSA achieves a speed up of 16.1 and 18.9 on W-2123 processor and Xeon Phi-7210 processor.
Thread Scalability
We measure the thread scalability of BGSA on our different target platforms. We use a single sample query data set versus the same database under different thread configurations. For the dual-socket E5-2620 CPU with 12 cores in total, we measure performance from 1 to 24 threads. For the W-2123 CPU with 4 cores, we measure performance form 1 to 8 threads. For a single Xeon Phi-7110 co-processor with 61 cores, we use 1 to 240 threads reserving a single core to handle system calls. We launch 1 to 256 threads on the KNL-based 64-core Xeon Phi-7210 processor. As the original BitPAl doesn't support multi-threading, we only compare BGSA to the SSE-based Parasail exectuted on the E5-2620 and AVX2-based Parasail executed on the W-2123 and the Xeon Phi-7210 processors.
The results are shown in Figure S6 . On SSE-based CPUs, BGSA shows good scalability from 1 to 12 threads. Performance only increases slightly between 12 to 24 threads, which means that hyperthreading can further improve performance on CPUs. And BGSA shows the similar trending on W-2123 processors. On KNC-based Xeon Phi co-processors, performance increases steadily from 1 to 240 threads. This means that instruction issue is a bottleneck on a KNC and hyper-threading can improve performance. On KNL-based Xeon Phi processors, performance hardly beyond 64 threads for (0, −1, −1) and increases marginally for (2, −3, −5). In general, hyper-threading benefits from more complex instruction flows. Parasail, in contrast, shows poor scalability for hyper-threading on both CPUs and KNL.
Overall Performance
We measure the overall performance on our various target platforms for different sequence lengths. In addition of evaluating it on homogeneous multi-core and many-core platforms, we also test the performance on a heterogeneous platform with a dual E5-2620 CPU and two Xeon Phi-7110 cards. Parasail is evaluated on a E5-2620 CPU with SSE intrinsics as well as on a W-2123 and a Phi-7210 with AVX2 intrinsics. SeqAn NW is evaluated on a E5-2620 CPU with SSE intrinsics as well as on a W-2123 and [6] since it has the best performance in our experiments and use 16-bit integers for vectorization. Figure S7 and S8 show the eveluation results. For the (0, −1, −1) scheme, BGSA outperforms Parasail by an average factor of 8.3 on an E5-2620 with SSE intrinsics, and by an average factor of 10.1 and 15.4 on a W-2123 and a Phi-7210 with AVX2 intrinsics. BGSA achieves an even higher speedup of 18.2, and 32.6 on a Phi-7110 using KNC intrisics and on a Phi-7210 using AVX-512 intrinsics, respectively, compared to Parasail AVX2 on a Phi-7210. And BGSA outperforms a speedup of 14.8 with AVX-512 intrinsics compared to Parasail AVX2 on a W-2123. BGSA outperforms SeqAn NW by an average factor of 6.8 on E5-2620 with SSE intrinsics, and with AVX512 intrinsics BGSA achieves an average speedup of 11.1 and 19.6 on a W-2123 and Phi-7210. We don't test BitPAl, SeqAn NW and Edlib here since none of them supports multi-threading and both of them are much slower than BGSA in single thread mode shown as Section 4.1.
For the (2, −3, −5) scheme, BGSA achieves similar performance compared to Parasail and SeqAn SSE on E5-2620 with SSE intrinsics. On a Phi-7210, BGSA outperforms Parasail by a factor of 2.4 on average when using AVX2 intrinsics. Using AVX-512 intrinsics on a Phi-7210, BGSA achieves an average speedup to 6.3. On a W-2123, BGSA achieves a speedup of 1.5 and 2.0 compared to Parasail AVX2 with AVX2 intrinsics and AVX-512 intrinsics. Compared to SeqAn NW, BGSA achieves a speedup of 1.5 and 3.7 on a W-2123 and a Phi-7210 with AVX-512 intrinsics. In addition, an efficiency of 98% can be achieved when BGSA is executed on the heterogeneous platform when E5-2620 CPUs and Phi-7110 co-processors are working together.
We have also measure the performance of BGSA when using the gcc compiler version (v5.4.0). The results are shown in Fig S9 in comparison to the icc-compiled version. We further provide the performance ratio between gcc and icc in Table 2 for convenience. For the (0, −1, −1) scoring scheme, the icc compiler outperforms the gcc compiler. For the (2, −3, −5) scheme, icc and gcc gain similar performance on Skylake architectures. For KNL, icc outperforms gcc for the AVX512 version of BGSA with a significant performance enhancement of 22%.
Furthermore, we have measured the performance of BGSA for semi-global alignment in comparison to global-alignment in Fig S10. For convenience, we present the performance ratio between semi-global and global alignment in Table 3 . For the (2, −3, −5) scheme, we can observe that semi-global alignment performance is almost identical to global-alignment performance. For the (0, −1, −1) scheme, however, the global-alignment performance is slightly higher on all tested platforms.
Performance using long reads
We also evaluate the performance of BGSA when dealing with long sequence. We have used PBSim [7] to generate a set of simulated PacBio reads with 10000 bases. Fig S11 shows R Q : R Q :
Figure S11: Overall performance comparison for long sequence. 
Banded Edit Distance Performance for Seed Verification
When verifying the candidate positions in seed-based read mappers, we are most concerned about whether the alignment score of two sequences is less than a error threshold e, rather than the specific score value. In this situation, we don't have to calculate the entire score matrix but only some cells in the matrix. Hyyro proposed a the banded Myers' algorithm [8] which only needs to calculate a diagonal tiling region with width 2e + 1 in the score matrix, thus reducing massive useless calculation. We have implemented the banded Myers' algorithm in BGSA with SSE, AVX2, KNC and AVX-512 intrinsics. Furthermore, we will terminate the calculation early if we find the alignment score is surely to be larger than e in the process of calculation for further improving the performance. We have compared the BGSA banded Myers' implementation with the implementations in FEM [9] , Razers3 [10] and NVBio [11] , and we also compare it with the original Myers' implementation in BGSA. For a read set, we have extracted 1 million reads from a real dataset from NCBI SRA (accession number SRR622458) with the length of 101bp. And we use FEM to generate the candidate matching positions in the human genome(hg19). We choose e = 4 and e = 7 in our testing, and we still measure the performance with TCPUS(Billion Cell Updates Per Second ). Although banded Myers' algorithm doesn't calculate all the cells in the score matrix, we still use the total number of matrix cells when calculating TCUPS in order to more clearly demonstrate the performance comparison between the banded and non-banded Myers' algorithm. Figure S12 show the comparison results. BGSA achieves a speedup of 2.2 and 7.1 compared to FEM SSE and Razers3 SSE with SSE intrinsics on E5-2620. On a W-2123 processor, BGSA outperforms FEM SSE and Razers3 SSE by average factor of 4.5 and 14.5 with AVX2 intrinsics. BGSA achieves speedups of 2.0 and 10.0 on a Phi-7110 with KNC intrinsics and 4.5 and 23.0 on a Phi-7210 with AVX-512 intrinsics, respectively, compared to FEM-SSE and Razers3 SSE on a Phi-7210. BGSA also achieves an average speedup of 4.3 on a Phi-7210 with AVX-512 intrinsics compared to the highly optimized banded Myers' implementation of the NVBio library on a Titan X GPU. Besides, compared to the original Myers' implementation in BGSA, banded Myers' implementation achieves a speedup of 12.3, 6.0, 6.6 and 8.0 on E5-2620, W-2123, Phi-7110 and Phi-7210, respectively.
