Abstract-A new Application-Specific Instruction-set Processor (ASIP) architecture for biological sequences alignment is proposed in this manuscript. This architecture achieves high processing throughputs by exploiting both fine and coarsegrained parallelism. The former is achieved by extending the Instruction Set Architecture (ISA) of a synthesizable processor to include multiple specialized SIMD instructions that implement vector-vector and vector-scalar arithmetic, logic, load/store and control operations. Coarse-grained parallelism is achieved by using multiple cores to cooperatively align multiple sequences in a shared memory architecture, comprising proper hardwarespecific synchronization mechanisms. To ease the programming, a compilation framework based on an adaptation of the GCC back-end was also implemented. The proposed system was prototyped and evaluated on a Xilinx Virtex-7 FPGA, achieving a 200MHz working frequency. A sequential and a state-of-theart SIMD implementations of the Smith-Waterman algorithm were programmed in both the proposed ASIP and an Intel Core i7 processor. When comparing the achieved speedups, it was observed that the proposed ISA achieves a 40x speedup, which contrasts with the 11x speedup provided by SSE2 in the Intel Core i7 processor. The scalability of the multi-core system was also evaluated and proved to scale almost linearly with the number of cores.
I. INTRODUCTION
Bioinformatics applications represent one class of algorithms with particularly high performance and efficiency requirements. Among those, protein and Deoxyribonucleic acid (DNA) sequence alignment algorithms, whose optimal solutions are usually obtained by using Dynamic Programming (DP) methods, tend to present a large runtime when executed in current General Purpose Processors (GPPs).
The Smith-Waterman (SW) algorithm [1] , characterized by an O(nm) time complexity, is a widely established DP algorithm to obtain the local alignment between a query sequence (q) and a database sequence (d), of sizes m and n respectively. It operates in two distinct phases: it starts by filling a score matrix H, followed by a traceback phase over this matrix. The matrix is filled by using an affine gap penalty model [2] (see eq. 1), where α and β represent the cost of gap opening and extension, and Sbc(q 
F (i, j) = max H(i − 1, j) − α ; F (i − 1, j) − β E(i, j) = max H(i, j − 1) − α ; E(i, j − 1) − β
To speedup the alignment, several hardware accelerators were developed [4] , but such solutions lack the adaptability and flexibility provided by GPPs and by Application Specific Instruction-set Processors (ASIPs). On the other hand, various SIMD parallelization approaches on GPPs have also been presented [3] . One of the fastest was proposed by M. Farrar [3] , who adopted a pre-computed query profile for the entire database sequence, and optimized the processing scheme by using a striped access pattern, where the computations are carried out in several separate stripes that cover different parts of the query sequence (see Fig. 1(a) ).
The query is divided into p equal length segments of size t = ⌊(m + p − 1)/p⌋, where p denotes the number of vector elements that can be accommodated in a SIMD register (each SIMD vector element is assigned to one distinct segment). Each matrix column, corresponding to a database symbol d [j] , is processed in t iterations, where each iteration simultaneously processes p query symbols, separated by t−1 lines in the score matrix. Fig. 1(b) illustrates the data dependencies between the last segment and the first segment of the next column of the (a) Striped processing scheme.
(b) Data dependencies. [3] . The first five SIMD iterations were numbered and represented with different gray levels (for simplicity, only 4 data elements are shown in each SIMD register).
978-1-4799-0493-8/13/$31.00 © 2013 IEEE ASAP 2013
score matrix H. With this processing pattern, it is possible to move the conditional statements related to the commitment of the vertical dependencies to an independent lazy loop, executed outside the inner loop, where they have to be considered only once, before starting the processing of the next database symbol, thus reducing the impact of the vertical dependencies.
To further accelerate this class of algorithms, a new ASIP, specifically adapted for biological sequence alignment algorithms, is proposed in this manuscript. The attained processing throughput is achieved as a result of a two-fold improvement in the original architecture: i) extension of the processor ISA to support multiple specialized SIMD vector instructions, to extensively exploit fine-grained parallelism; and ii) implementation of an extensive multi-core computational structure, composed by multiple instantiations of the designed ASIP, in order to efficiently exploit coarse-grained parallelism.
II. DEDICATED SIMD INSTRUCTION SET FOR
BIOLOGICAL SEQUENCES ALIGNMENT Due to its higher performance and prevalence in most widely established bioinformatics applications, Farrar's SIMD implementation [3] will be herein adopted as the elected casestudy. By analyzing the algorithm's pseudo-code (see Fig.2(a) ), it is clear that the adoption of vector arithmetic instructions will potentially accelerate this algorithm. These instructions should not only speedup the operations between vectors, but they might also facilitate the several operations between vectors and scalars, which are particularly useful when subtracting the gap penalties. The shifting of the F and H vectors can also be efficiently implemented with a vector element shift instruction. Since all these new instructions will be dealing with SIMD vectors, it is also advantageous to include new memory access instructions, to handle vector-sized variables.
A special attention should be also devoted to the definition of optimized control instructions. This effort is justified by the significant predominance of loop procedures in these DP algorithms (generally implemented with conditional branch instructions), as well as the severe penalties that these control instructions generally impose on deep pipeline architectures. In particular, a new specialized branch instruction to simultaneously assert a branch condition in all vector elements, without any additional processing, will significantly increase the achieved performance (e.g.: execution of the lazy loop).
Although not limited at this respect, the proposed instruction set and the corresponding data-path (see section III) provides support for the same register and vector-element sizes as Intel SSE2 (used by Farrar [3] ), i.e. 128-bit registers, with 8 or 16 elements. Furthermore, the vector elements of each register can take any size, starting from 8 bits to the limit imposed by the register size. However, to obtain a fair comparison with Farrar's [3] SSE2 implementation, only 128-bit registers with 8-bit vector elements will be considered in this particular case-study. On the other hand, any non-SIMD instruction will only operate over the least-significant part of the register (corresponding to a scalar word). Such solution confines the critical-path to the non-SIMD data-path, thus making it independent of the extended SIMD register size.
The proposed ISA extension defines 48 specialized SIMD instructions for arithmetic, logic, memory access and control operations. By comparing Farrar's [3] algorithm implementation based on Intel SSE2 ISA (see Fig.2 (b)) with an implementation based on the proposed ISA (see Fig.2 (c)), it can be observed that an immediate gain, regarding the number of instructions, is promptly achieved with the proposed ISA, with more visible advantages in the lazy-loop. The major contributor to this reduction is the new set of vectorized control instructions, that significantly reduce the control overhead.
It is also important to note that another significant advantage of the proposed ASIP arises from the fact that it adopts a strict Reduced Instruction Set Computer (RISC) paradigm based on a shallow pipeline structure, contrasting with Intel's Complex Instruction Set Computer (CISC) model. As a consequence, the observed difference in the number of instantiated instructions, together with the RISC singlecycle per instruction ratio (instead of CISC multiple-cycle per instruction), will significantly augment the processing gain, as it will be demonstrated in section V.
Pseudo-code [3] Intel SSE2 ASIP 
movdqa xmm1,xmm2 lv r4, r6, r10 vT = vH -gapOpen psubusb xmm4,xmm2 rsubvs r3, r7, r4 movdqa xmm0,xmm11 vT = vF -vT psubusb xmm2,xmm11 rsubvv r3, r3, r5 movdqa xmm11,xmm2 pcmpeqb xmm6,xmm2 pmovmskb xmm2,r8d cmp $0xffff,r8d jne 83e <start+0x1ca> bgtiv r3, -68 (a) (b) (c) 
III. SIMD PROCESSOR ARCHITECTURE
The MB-LITE [5] soft-core was used as the base architecture for the implementation of the proposed ISA, not only due to its simple and portable processing structure, but also because it is a compliant implementation of the well known MicroBlaze ISA. Furthermore, since the GNU Compiler Collection (GCC) already supports the MicroBlaze processor, adding the new instructions' mnemonics and opcodes was easily accomplished by conveniently adapting the corresponding back-end.
The MB-LITE design is highly configurable and is relatively easy to adapt to the proposed ISA. Accordingly, some groups of instructions were left out, including the multiplication and barrel shifter operations, as well as all floating point and special register operations. In fact, the reduced hardware resources that are required by this core were also taken into account, prospecting the bases for a scalable multicore processing platform to exploit coarse-grained parallelism.
Despite being fully parameterizable, the configuration of the designed SIMD module that was adopted for this specific case-study uses 128-bit registers with 16 8-bit SIMD elements. To support the proposed extension of the ISA, the execution unit had to be modified, by extending its original Arithmetic and Logic Unit (ALU) to include a new SIMD module. As an example, the addition and subtraction operations require one adder per SIMD vector element, together with some extra multiplexing logic. Since different types of SIMD operations are supported (vector-vector, vector-scalar and inner-vector), the required vector elements have to be selected from the corresponding registers and only then does the execution unit perform all the parallel arithmetic operations. The results are then chosen based on appropriate control signals.
The new maximum instruction, particularly useful for this class of algorithms, deserved a special attention. It was based on the already existing compare instruction, comprising a subtraction followed by a signal evaluation. Therefore, the same logic can be used to implement these two instructions, requiring only a multiplexer to choose the maximum between the two operands. To avoid increasing the critical-path, the decision logic was moved to the next pipeline stage and to the pipeline forwarding lines. This new maximum instruction not only substitutes one compare and one branch instruction, but it also prevents the pipeline flush (gaining 3 or 4 clock cycles), depending on whether the branch has delay slots or not.
Whenever possible, the same opcode was assigned to the new SIMD instructions as their non-SIMD counterparts, by using unused fields to distinguish them in the processor control unit. With this option, it was possible to re-use most of the original decoding structures, except for a few control signals that had to be generated from such bit-fields.
IV. MULTI-CORE PROCESSING PLATFORM
Many High Throughput Short Read (HTSR) sequencing applications require the alignment of multiple query sequences to one or more database sequences. This requirement adds a thread-level parallelism to the computation, where multiple cores concurrently align multiple query sequences with one or more database sequences. To allow this parallel computation, a shared memory is used to store the database and query sequences [6] . The computation is controlled by a master core, which manages the sequence alignment queue and the multiple processing elements. To initiate the sequence alignment, the master core needs to communicate a minimal set of data to the target processing core, which consists of the address (in main memory) and the length of the query and database sequences.
To compute the alignment score for multiple query sequences, the architecture ilustrated in Fig. 3 is now proposed. It is composed of: a memory element, to store both the biological sequences and the alignment scores; a master core, which is responsible for managing the sequence alignment queue; multiple processing cores based on the specialized SIMD ASIP; and a mutex circuit, to handle core synchronizations. All elements are interconnected by an AMBA 3 AHB-Lite compatible shared bus. To reduce the amount of data that is transferred between the master and the processing cores, it was considered the shared memory model studied in [7] , specifically developed for this application domain.
V. EXPERIMENTAL RESULTS
To evaluate the proposed multi-core processing framework, a thorough performance analysis of the complete system was conducted, by prototyping it in a Xilinx Virtex 7 FPGA (XC7VX485T). To synthesize the design and perform the place-and-route procedure, the Xilinx ISE 14.4 tool-chain was used. Accurate clock cycle measurements of the required time to execute the biological sequences analysis in the proposed platform was achieved by using Modelsim SE 10.0b. Table I depicts an evaluation of the resources overhead introduced by the proposed extended ISA and of the attained maximum operating frequencies for both a single-core configuration of the proposed ASIP and for the multi-core system. The extension of the original MB-LITE ISA to support the new instructions led to a frequency decrease of about 27 MHz relative to the original MB-LITE implementation (not shown in the table). This demonstrates that the ISA extension had a reduced impact on the original critical path, which is now limited by the added multiplexing logic that is required to implement the SIMD instructions.
To demonstrate the advantages of the proposed SIMD ISA, the number of clock cycles required to execute a DNA sequence alignment procedure on both the proposed ASIP and on a state-of-the-art superscalar GPP, capable of multiple instruction issue, out-of-order and speculative execution (Intel Core i7 950 processor) was measured. For this test, both the sequential and the Farrar's SIMD versions of the SW algorithm were considered. The sequence alignment code was compiled with GCC 4.6.2, using flags -O2 (sequential case) and -O (SIMD case), corresponding to the most favorable parametrization for each case. On the Intel architecture, cycleaccurate measurements were obtained by using the PAPI library to read the processor performance counters. For the considered benchmark, a DNA data-set was used, which is composed of several database sequences ranging from 128 to 16384 elements and a query sequence of length 64. The database sequences correspond to a random selection of subsequences of the Homo Sapiens chromosome Y genomic contig (NT_011875.12), while the query sequence was generated by randomly combining reads from run ERR004756 of study ERP000053 (human DNA). Table II presents the average number of clock cycles required to execute the DNA sequence alignment with the proposed ASIP and the Intel Core i7. As it can be observed in this table, the Intel Core i7 achieves a maximum speedup of 11.36x, while the ASIP, with the proposed ISA, achieves a maximum speedup of 40.69x, i.e., a value about 4 times higher. This result demonstrates that the proposed ISA extension is well tuned for operations commonly adopted in sequence alignment algorithms. Furthermore, it is important to recall that the ASIP SIMD register size was configured to a 128-bit width, for the single purpose of ensuring a fair comparison with the Intel processor, although it may be easily extended in order to increase the ASIP's performance. To analyze the scalability of the multi-core system, the obtained speedup was measured, when all the processing cores are executing Farrar's SIMD version of the sequence alignment algorithm. Fig. 4 presents the obtained speedup values in what concerns the clock cycles and the total processing time of the proposed multi-core structure. Such speedup values were obtained by using a single ASIP core as the reference. The observed speedup increases almost linearly for configurations up to 16 cores. With additional cores, the contention in the shared bus becomes a limiting factor [7] , thus reducing the effectiveness of the extra cores and resulting in a sub-linear speedup increase. Still, when considering the initial non-SIMD sequential implementation as reference, the obtained results demonstrate that an 750x processing time speedup can be obtained with a 32-core parallel SIMD implementation of the proposed ASIP. It should be noticed that, due to the Block-RAM resource requirements of each core, a maximum of 38 processing cores can be instantiated on the prototyping FPGA.
VI. CONCLUSION
A new ASIP architecture, specifically adapted for biological sequence alignment algorithms, was proposed. The presented ASIP is able to achieve high processing throughputs through an optimized architecture that exploits both fine and coarse-grained parallelism. Fine-grained parallelism is achieved by expanding the MicroBlaze ISA to support multiple specialized SIMD instructions, and by conveniently adapting the pipeline architecture of the MB-LITE soft-core. This adaptation provided a speedup of about 40x, when compared to a non-SIMD implementation of the SW algorithm. In contrast, the same SIMD implementation executed in an Intel Core i7 only achieved a speedup of about 11x. Coarsegrained parallelism was also exploited by using a multicore computational structure composed of multiple ASIPs. A functional prototype on a Xilinx Virtex-7 FPGA device demonstrated that a linear speedup can be achieved with up to 16 processing cores. Furthermore, experimental setups using more cores demonstrated that the proposed system is capable of achieving a cumulative speedup of 750x with 32 cores, despite the observable contention in the interconnection bus.
