A novel processing-in-storage (PRinS) architecture based on Resistive CAM (ReCAM) is described and proposed for Smith-Waterman (S-W) sequence alignment. The ReCAM massively-parallel compare operation finds matching base-pairs in a fixed number of cycles, regardless of sequence length. The ReCAM PRinS S-W algorithm is simulated and compared to FPGA, Xeon Phi and GPU-based implementations, showing at least 3.7× higher throughput and at least 15× lower power dissipation.
INTRODUCTION
With the approaching end of Moore's law, academia and industry have an increased interest in non-von Neumann compute paradigms. One example is content addressable associative processing [1] . CMOSbased content addressable memories (CAM) require large bit-cells, limiting chip capacity and forcing most data-intensive applications to employ less functional random access memories. Novel resistive materials dissipate little heat and allow for 3D stacking. Combined with CMOS, resistive materials can be used in a CAM bit-cell, resulting in a small cell area, low leakage power and increased overall chip area efficiency.
This work presents a novel resistive CAM-based storage system architecture with processing-in-storage (PRinS) compute paradigm. The system is an in-storage accelerator that may scale up to hundreds of millions of processing units (PUs) spread across multiple silicon dies, each containing several million PUs. In addition, the system performs the computations in-situ, resulting in increased performance and reduced energy consumption on massively parallel workloads. We name the system Resistive CAM or ReCAM.
The first part of this paper presents ReCAM system architecture and describes its main components. The second part demonstrates PRinS implementation of a key algorithm in bioinformatics, the Smith-Waterman (S-W) DNA local sequence alignment. We also present simulation results and compare the performance of ReCAM with four state-of-the-art large-scale accelerator systems. We show that an in-storage implementation of S-W on ReCAM may achieve on average 3.7× higher throughput while dissipating 15× lower power compared with a 384-GPU implementation, the largest S-W implementation found in the literature.
The rest of this paper is organized as follows: Section 2 presents the architecture of ReCAM based storage. Section 3 explores the in-storage implementation of S-W. Simulation results are discussed in Section 4. Section 5 presents a discussion on the scalability of a ReCAM storage system, and Section 6 offers conclusions.
RECAM-BASED STORAGE
Resistive memories store information by modulating the resistance of nanoscale storage elements. They are nonvolatile, free of leakage power, and emerge as longterm potential alternatives to charge-based memories, including NAND flash. The metal-oxide resistive random access memory (ReRAM) is considered a potential technology to replace next-generation nonvolatile memories [2] . Its main features are high reliability and fast access speed. A test-chip of 32GB device with two ReRAM-based memory layers and a CMOS logic layer underneath has been developed [3] , demonstrating design techniques to achieve a high density functional chip. Figure 1 (b), consists of two transistors and two resistive elements (2T2R). The KEY register contains a data word to be written or compared against. The MASK register defines the active columns for write and read operations, enabling bit selectivity. The TAG register (Figure 1(c) ) marks the rows that are matched by the compare operation and may be affected by a parallel write. The TAG register enables chaining multiple ReCAM ICs.
ReCAM Crossbar Array
In a conventional CAM, compare operation is typically followed by a read of the matched data word. When in-storage processing involves arithmetic operations, a compare is usually followed by a parallel write into the unmasked bits of all tagged rows, and additional capabilities, such as read and reduction operations, are included [11] .
Any computational expression can be efficiently implemented in ReCAM storage using line-by-line execution of the truth table of the expression [8] . Arithmetic operations are typically performed bit-serially. Table 1 lists the operations used in S-W implementation (Section 3) and the number of cycles required per each one. Shifting down a consecutive block of rows by one row position requires three cycles per bit. First, compare-to-'1' copies the source bit-column of all rows into the TAG. Second, shift moves the TAG vector down by setting the shift-select line (Figure 1(c 
(a)
Register File 8x2 Add Look-up rows times 2 for compare and write amount to 512 cycles). Row-wise maximum compares in parallel two 32-bit numbers in each row. Max Scalar tags all rows that contain the maximal value in the selected element.
Additional operations, such as parallel and reduction arithmetic, may be required for other algorithms.
System Architecture
Conceptually, the ReCAM comprises hundreds of millions of rows, each serving as a computational unit. The entire array may be divided into multiple smaller ICs (due to power per die restrictions, Figure 2 (a)), which use the same MASK and KEY. A row is fully contained within an IC. All ICs are daisy chained for Shift and Max Scalar operations. Therefore, in practice, operations listed in Table 1 take several more cycles to enable inter-IC shift operations. The ReCAM storage system uses a microcontroller ( Figure 2 (b)) similar to [12] . It issues instructions, sets the key and mask registers, handles control sequences and executes read requests. In addition, the microcontroller holds the associative instructions buffer, containing the truth tables for associative instructions. Since instructions are performed bit-serially, these tables are typically small, as evident in Figure 2 (b) [11] . Part of the associative instructions buffer is user-programmable with custom instructions, such as the match operation in Table 1 .
RECAM BASED IN-STORAGE SMITH-WATERMAN IMPLEMENTATION

Smith-Waterman Algorithm
S-W identifies the optimal local alignment of two sequences by computing a two-dimensional scoring matrix . Each , element is calculated according to Eq. (3). ( , ) is the match score between the basepairs in row (i th element of sequence A) and column (j th element of sequence B). Matching base-pairs score positively (e.g., +2), while mismatching result in negative score (e.g., -1). The optimal alignment score between two sequences is the highest score in the matrix .
The alignment may contain gaps in both sequences which are penalized in the score calculation (by negative scores). According to the affine gap model [5] , opening a gap is harder than extending it, therefore the penalty for opening a gap is larger. The affine penalty scheme is calculated with two additional matrices, and , equations (1) and (2); and are the penalties for starting and extending a gap, respectively. The matrices , and are initialized with 0, = ,0 = 0, = ,0 = 0, = ,0 = 0 for all and .
Filling the scoring matrix is the computationally intensive part of S-W. In a sequential implementation of the algorithm, cell filling is performed in either rowor column-wise order. A parallel implementation allows all independent cells to be computed in the same iteration. Such cells reside on the same antidiagonal. The matrix is filled along the main diagonal, as illustrated in Figure 3 .
The sequential time complexity is ( ), where and are the respective lengths of the sequences. Parallel time complexity on parallel processing units is ( ⁄ ). In ReCAM, the processing unit is a memory row. Since ReCAM may comprise hundreds of millions of rows, unlike GPU or FPGA implementations, could possibly be larger than { , } even for X very large and . Hence, ReCAM can achieve linear time complexity of ( { , }).
ReCAM Implementation of S-W
In this work we focus on finding the maximal alignment score. Therefore storing the entire matrix in memory is not needed. This is in contrast to the full algorithm which also contains the traceback part for finding the alignment [4] .
A total of four antidiagonals is required to compute a new antidiagonal of : two of (see Eq. (1)-(3) and green, red of Figure 3 ), one of (see Eq. (1)) and one of (see Eq. (2)). Thus, five matrix antidiagonals are stored in the ReCAM in each iteration (E, F, AD[0], AD [1] and AD [2] in Figure 3 and Figure 5 ). A tmp field stores partial results. The overall space complexity required for executing the algorithm is therefore (min{ , }).
Each of the five antidiagonals is mapped onto a 32bit column in the ReCAM. Every ReCAM row retains one element of the vectors seqA, SeqB, E, F, AD[0], AD [1] , AD [2] and tmp. The first two numbers are the 2bit elements of sequences A and B, respectively.
S-W algorithm implementation on ReCAM can be divided into three logical sections. The first section, marked 1 in Figure 3 , starts at the top-left cell and covers a triangle with each edge of length min{ , } cells. In it, the most recently scored antidiagonal is longer by one cell than the previous one. The third section (3 in Figure 3 ) is of a similar shape and same dimensions, ending at the bottom-right cell. In it, every new scored antidiagonal is one cell shorter than the previous one. The second section (2 in Figure 3 ) is a parallelogram between the first and third sections. In it, all antidiagonals are of the same length. Figure 4 presents the pseudocode of the S-W score finding on ReCAM. Three ReCAM columns are required to store last two scored antidiagonals of and the presently computed one, notated as AD [2] -AD[0] in code. During execution, these columns are cyclically buffered; the oldest scores are replaced by the new ones (line 4 in Figure) . Additional three 32-bit columns are used to store antidiagonals of , and tmp. Figure 5 shows a ReCAM crossbar snapshot at the beginning (a) and the end (b) of a single iteration of Figure 5 ). After calculating the matching score (line 7), AD[left_AD] is no longer required and is therefore used to store temporary results. Next, the max between the match score and zero is calculated (line 8). Note that (ii) in equations (1) and (2) belong to the same antidiagonal, therefore it is enough to calculate (ii) once for both E and F (line 9). Lines 10-16 compute equations (1)-(3). In line 15, after E is calculated, its columns are shifted one row down to have the values of E aligned with the appropriate ones in AD [right_AD] . In section 1 and 2 of Figure 3 , the down-shifted columns require zero-padding of the top-most ReCAM row (not shown in Figure 4 ). At the end (line 17), the global max is updated with the maximal cell score. After a specific base-pair of has been aligned with all base-pairs, it is cyclically shifted to its original position (not displayed in Figure) .
The total number of iterations is the sum of lengths of the two sequences. 
SIMULATION
The S-W algorithm is simulated on ReCAM using the cycle-accurate simulator introduced in [8], employing ReCAM performance and power figures obtained by SPICE simulations. The simulated ReCAM parameters are listed in Table 2 . Power figure was taken from [8] .
The simulation employs sequence data retrieved from the National Center for Biotechnology Information (NCBI), comparing human (GRCh37) and chimpanzee (panTro4) homologous chromosomes, similar to [10] . The CUPS metric (Cell Updates per Second) is used to measure S-W performance. Results are compared to other works in Table 3 . A multi-GPU implementation [10] reached 11.1 TCUPS on a cluster of 128 compute nodes with a total of 384 Tesla M2090
GPUs. An FPGA implementation of S-W reaches 6.0 TCUPS on the RIVYERA platform [6] having 128 Xilinx Spartan-6 LX150 FPGAs. A four Xeon Phi implementation achieves 0.23 TCUPS [7] . On ReCAM, we demonstrate 53 TCUPS, computing a total of 57.2×10 12 scores. The table also shows computed GCUPS/Watt ratios; ReCAM is close to twice better than the FPGA solution and 80× better than the GPU system. Figure 4 . AD [2] contents is being replaced with the new result. Bottom rows in a crossbar IC are daisy-chained to the next IC in a shift instruction. The simulated ReCAM power dissipation is 6.6kW. The optimal setting to sustain this power figure with minimal performance overhead is dividing the ReCAM into 32 separate ICs, each with 256MB and 8M rows. The multi-GPU implementation using 384 Tesla M2090 GPUs and 256 Intel Xeon E5-2670 CPUs might dissipate 100kW, 15× higher power. Table 4 shows additional comparisons of ReCAM and the multi-GPU cluster [10] , demonstrating up to 3.7× faster execution on ReCAM.
Dynamic Programming Matrix
SCALABILITY OF SEQUENCE ALIGNMENT ON RECAM
Consider the case of one billion organism sequences. Each sequence is hundreds of millions base-pairs in size on average. Analyzing the contents of these sequences can lead to discoveries such as identification of disease-carrying genes, determination of evolutionary events and identification of regions that can be used to silence genes [9] . Performing an all-to-all alignment of the entire sequence database in a conventional data-center is not scalable. Every two sequences will require fetching to the main memory, close to the processing unit (CPU or accelerator). The high communication cost between separate storage units causes the system to be I/O bound in an all-to-all type of computation.
On the other hand, ReCAM-based storage is more scalable. Its inherent parallelism allows for scalability when adding more ICs, increasing storage capacity at no performance cost. The compute capability is linearly scalable in the number of ICs. Therefore, performing an all-to-all alignment of large sets, such as one billion sequences, does not require external communication for the ReCAM, in contrast to datacenter-scale storage. A more effective solution, in terms of performance and energy, is using ReCAM as primary storage when large alignment operations are constantly performed.
CONCLUSIONS
This paper explores PRinS (PRocessing in Storage) implementation for the scoring step of the Smith-Waterman DNA sequence alignment algorithm on a novel solid state storage device, based on Resistive Content Addressable Memory (ReCAM). ReCAM enables storage with in-situ processing capabilities. It can contain hundreds of millions of data rows, each serving as a processing unit. The proposed ReCAM system is divided into multiple ICs to accommodate power density constraints.
The sheer number of database searches on whole genomes creates a need for considerably higher performance than exists today. For example, aligning two very long sequences, such as complete human and chimpanzee chromosomes, is a difficult task for contemporary accelerators. Since the ReCAM contains hundreds of millions of PUs, its performance increases with input size. We show that ReCAM has the potential to provide 3.7× performance improvement and 15× lower power dissipation over a 384-GPU cluster.
This research can be extended in several ways: First, the ReCAM S-W scoring algorithm can be extended to provide complete DNA sequence alignment (i.e., both matrix-fill and traceback steps), maintaining the same performance and power advantages. Second, the ReCAM algorithm can be applied in parallel to complete DNA sequences of two organisms, and not only to specific chromosomes. Third, the proposed S-W ReCAM algorithm can be applied to the wider challenge of aligning protein sequences. That problem is more challenging than DNA alignment because the required substitution matrix is typically 20×20 rather than 2×2, and the ReCAM could store the entire substitution matrix, resulting in efficient parallel processing.
ReCAM architecture, capable of general purpose associative processing, can also be applied to other challenging problems, such as machine learning and graph algorithms.
ACKNOWLEDGMENT
Present work was partially funded by the Intel Collaborative Research Institute for Computational Intelligence.
