PANDA: Processing-in-MRAM Accelerated De Bruijn Graph based DNA Assembly by Angizi, Shaahin et al.
1PANDA: Processing-in-MRAM Accelerated
De Bruijn Graph based DNA Assembly
Shaahin Angizi†, Naima Ahmed Fahmi∗, Wei Zhang∗ and Deliang Fan†
†School of Electrical, Computer and Energy Engineering, Arizona State University, Tempe, AZ 85287
∗Department of Computer Science, University of Central Florida, Orlando, FL 32816
sangizi@asu.edu, fnaima@knights.ucf.edu, wzhang.cs@ucf.edu, dfan@asu.edu
Abstract—Spurred by widening gap between data process-
ing speed and data communication speed in Von-Neumann
computing architectures, some bioinformatic applications have
harnessed the computational power of Processing-in-Memory
(PIM) platforms. However, the performance of PIMs unavoidably
diminishes when dealing with such complex applications seeking
bulk bit-wise comparison or addition operations. In this work,
we present an efficient Processing-in-MRAM Accelerated De
Bruijn Graph based DNA Assembly platform named PANDA
based on an optimized and hardware-friendly genome assembly
algorithm. PANDA is able to assemble large-scale DNA sequence
data-set from all-pair overlaps. We first design PANDA plat-
form that exploits MRAM as a computational memory and
converts it to a potent processing unit for genome assembly.
PANDA can execute not only efficient bulk bit-wise X(N)OR-
based comparison/addition operations heavily required for the
genome assembly task but a full-set of 2-/3-input logic operations
inside MRAM chip. We then develop a highly parallel and step-
by-step hardware-friendly DNA assembly algorithm for PANDA
that only requires the developed in-memory logic operations. The
platform is then configured with a novel data partitioning and
mapping technique that provides local storage and processing
to fully utilize the algorithm-level’s parallelism. The cross-layer
simulation results demonstrate that PANDA reduces the run time
and power, respectively, by a factor of 18 and 11 compared with
CPU. Besides, speed-ups of up-to 2-4× can be obtained over
recent processing-in-MRAM platforms to perform the same task.
Index Terms—Processing-in-Memory, DNA Assembly, SOT-
MRAM.
I. INTRODUCTION
With the advent of high-throughput second generation
parallel sequencing technologies, the process of generating
fast and accurate large-scale genomics data has become a
significant advancement. Such data can enable us to measure
the molecular activities in cells more accurately by analyz-
ing the genomics activities, including mRNA quantification,
genetic variants detection, and differential gene expression
analysis. Thus, by understanding the transcriptomic diversity,
we can improve phenotype predictions and provide more
accurate disease diagnostics [1]. However, the reconstruction
of the full-length transcripts considering sequencing errors is
a challenging task in terms of computation and time. Since
the current cDNA sequencing technology cannot read whole
genomes in one step [2], the data produced by the sequencer
is extensively fragmented due to the presence of repeated
chunks of sequences, duplicated reads, and large gaps. Thus,
the goal of genome assembly process is to combine these large
Figure 1. (a) The de Bruijn graph-based genome assembly process, (b) Break
down of execution time of Meraculous genome assembler for human and
wheat data-set [3], [2].
number of fragmented short reads and merge them into long
contiguous pieces of sequence (i.e. contigs), to reconstruct the
original chromosome from which the DNA is originated as
shown in Fig. 1a.
Today’s bioinformatics application acceleration solutions
are mostly based on the von-Neumann architecture with sepa-
rate computing and memory components connecting via buses
and inevitably consumes a large amount of energy in data
movement between them [4], [5]. In the last two decades,
Processing-in-Memory (PIM) architecture, as a potentially
viable way to solve the memory wall challenge, has been
well explored for different applications [5], [6], [7], [8],
[9], [10], [11]. Especially processing-in-non-volatile memory
architecture has achieved remarkable success by dramatically
reducing data transfer energy and latency [12], [13], [14],
[15], [16]. The key concept behind PIM is to realize logic
computation within memory to process data by leveraging the
inherent parallel computing mechanism and exploiting large
internal memory bandwidth. Besides, most of CPU [17]-/ GPU
[18]-/ FPGA [19]- and even PIM [4], [5]-based efforts have
only focused on the DNA short read alignment problem, while
the de novo genome assembly problem still relies mostly on
CPU-based solutions [20]. De novo assemblers are catego-
rized into Overlap Layout Consensus (OLC), greedy, and de
Bruijn graph-based designs. Recently, de Bruijn graph-based
assemblers have gained much more attention as they are able
to solve the problem using Euler path in a polynomial time
rather than finding Hamiltonian path in OLC-based assemblers
as an NP hard problem [21]. There are multiple CPU-based
genome assemblers implementing the bi-directed de Bruijn
graph model, such as Velvet [22], Trinity [23], etc. However,
ar
X
iv
:2
00
8.
06
17
7v
1 
 [c
s.A
R]
  1
4 A
ug
 20
20
2only a few GPU-accelerated assemblers have been presented
such as GPU-Euler [20], [24], [25]. This mainly comes from
the nature of the assembly workload that is not only compute-
intensive but also extremely data-intensive requiring very large
working memories. Therefore adapting such problem to use
GPUs with their limited memory capacities has brought many
challenges [26]. A graph-based genome assembly process,
shown in Fig. 1a, as the main focus of this work, basically
consists of multiple stages, i.e. k-mer analysis for creating a
Hashmap, graph construction and traversal, and scaffolding
and gap closing. Fig. 1b depicts the breakdown of execution
time for the well-known Meraculous assembler [3] for the
human and wheat data sets. We observe that Hashmap and
graph construction/ traversal are the two most expensive
components, which together take over 80% of the total run
time.
This motivates us to show that the genome assembly prob-
lem and especially computationally-loaded components can
exploit the large internal bandwidth of Magnetic Random Ac-
cess Memory (MRAM) chip for PIM acceleration. Moreover,
with a careful observation of genome assembly workload, it
turns out this task heavily relies on comparison and addi-
tion operations. However, due to the intrinsic complexity of
X(N)OR logic, the throughput of processing-in-memory plat-
forms [12], [13], [5], [27], [28] unavoidably diminishes when
dealing with such bulk bit-wise operations. This is because
multi-cycle majority/AND/OR-based operations. In this work,
we explore a highly-parallel and PIM-friendly implementation
of de Bruijn graph-based genome assembly that can accelerate
especially the first two stages of the algorithm. Overall this
paper makes the following contributions:
(1) To the best of our knowledge, this work is the first
that designs a high-throughput comparison/addition-friendly
processing-in-MRAM architecture for the de Bruijn graph-
based genome assembly. We develop PANDA based on a set of
innovative microarchitectural and circuit-level schemes to re-
alize a data-parallel computational core for genome assembly;
(2) We reconstruct the existing genome assembly algorithm
in a step-by-step fashion to be fully implemented in PIM
platforms. It supports short read analysis, graph construction,
and traversal; (3) We propose a dense data mapping and
partitioning scheme to process the indices locally and handle
various length DNA sequences; (4) We extensively assess
and compare PANDA’s performance, energy-efficiency, and
memory bottleneck ratio with a CPU and recent potential PIM
platforms.
II. PANDA PLATFORM
A. SOT-MRAM
Fig. 2a shows a Spin-Orbit Torque Magnetic Random
Access Memory (SOT-MRAM) device structure. The storage
element in SOT-MRAM is SHE-MTJ [29], [30], a composite
device structure of a Spin Hall Metal (SHM) and Magnetic
Tunnel Junction (MTJ). The binary data is stored as resis-
tance states of MTJ. Data-‘0’(/‘1’) is encoded as the MTJ’s
lower(/higher) resistance or parallel(/anti-parallel) magnetiza-
tion in both magnetic layers (free and fixed layers). Here
Pinned Layer
Tunneling barrier
Free Layer
Anti-Parallel
State (AP):
RAP ->  1 
Parallel
State (P):
RP ->  0 
Write current 
Read current 
WL
B
L S
L
WWL
W
B
L
SL
R
B
L
RWL
Pinned Layer
Tunneling barrier
Free Layer
Heavy Metal
Pinned Layer
Tunneling barrier
Free Layer
Heavy Metal
(SHM)
X
YZ
WL
B
L S
L
X
Y
Z
Write 0 
X
Y
Z
Anti-Parallel
State (AP):
RAP ->  1 
Parallel
State (P):
RP ->  0 
P AP AP P
Write 1 
Write current Read current 
IREAD 
WWL
W
B
L
SL
R
B
L
RWL
IWRITE 
MTJ
SHM
MTJ
(a)
Pinned Layer
Tunneling barrier
Free Layer
Anti-Parallel
State (AP):
RAP ->  1 
Parallel
State (P):
RP ->  0 
Write current 
Read current 
WL
B
L S
L
WWL
W
B
L
SL
R
B
L
RWL
Pinned Layer
Tunneling barrier
Free Layer
Heavy Metal
Pinned Layer
Tunneling barrier
Free Layer
Heavy Metal
X
YZ
WL
B
L S
L
X
Y
Z
Write 0 
X
Y
Z
Anti-Parallel
State (AP):
RAP ->  1 
Parallel
State (P):
RP ->  0 
P AP AP P
Write 1 
Write current Read current 
IREAD 
WWL
W
B
L
SL
R
B
L
RWL
IWRITE 
MTJ
SHM Operations
Write
‘1’(‘0’) Read
WWL VDD 0
RWL 0 VDD
RBL 0 IREAD
WBL VWP (VWN ) 0
SL 0 0
(b) (c)
Figure 2. (a) SOT-MRAM device structure and Spin Hall Effect, (b)
Schematic and (c) biasing conditions of SOT-MRAM bit-cell.
the flow of charge current (±y) through the SHM (Tungsten,
β − W [31]) will cause accumulation of opposite directed
spin on both surfaces of SHM due to spin Hall effect [29].
Thus, a spin current flowing in ±z is generated and further
produces spin-orbit torque (SOT) on the adjacent free magnetic
layer, causing switch of magnetization. Each cell located in the
computational sub-array is connected with a Write Word Line
(WWL), Write Bit Line (WBL), Read Word Line (RWL) Read
Bit Line (RBL), and Source Line (SL). The bit-cell structure
of 2T1R SOT-MRAM and its biasing conditions are shown in
Fig. 2b and 2c, respectively. In this work, the magnetization
dynamics of Free Layer (m) are modeled by LLG equation
with spin-transfer torque terms, which can be mathematically
described as [29]:
dm
dt
= −|γ|m×Heff + α
(
m× dm
dt
)
+ |γ|β(m×mp ×m)− |γ|β′(m×mp) (1)
β = | h¯
2µ0e
| IcP
AMTJ tFLMs
(2)
where h¯ is the reduced plank constant, γ is the gyromagnetic
ratio, Ic is the charge current flowing through MTJ, tFL is the
thickness of free layer, ′ is the second Spin transfer torque
coefficient, and Heff is the effective magnetic field, P is
the effective polarization factor, AMTJ is the cross sectional
area of MTJ, mp is the unit polarization direction. Note that
the ferromagnets in MTJ have In-plane Magnetic Anisotropy
(IMA) in x-axis [29]. With the given thickness (1.2nm) of the
tunneling layer (MgO), the Tunnel Magneto-Resistance (TMR)
of the MTJ is ∼ 171.2%.
B. Architecture Design
We develop PANDA platform based on typical SOT-MRAM
hierarchy. Each memory chip consists of multiple memory
banks divided into 2D sub-arrays of SOT-MRAM cells as
shown in Fig. 3a. We then apply our modification on the sub-
array level to make it reconfigurable to support both memory
operation and in-memory bit-line computation. As depicted
3Bank
b
u
ff
er
I/
O
C
tr
l
PANDA chip
GWWL
GRBL
Ctrl
C
M
Cmd
Decoder
Cmd
Add
Timing Ctrl
D
a
ta
 f
lo
w
 c
tr
l
C
O
R
3
C
A
N
D
3
C
M
A
J
Ctrl
Res-box
ROR3
RMAJ
CMAJ
COR3
Isense (CAND3 , CMAJ , COR3 , CM )
Carry
Iref
RAND3
RM
CM
CAND3
Sum
Add-box
Vsense
Hashmap
(S,k)
DeBruijn
(Hashmap,k)
Find Start Vertex 
(G)
Fleury Algorithm
 (G, start, edge cnt, out degree)
Stage I Stage II Stage III
Read-S
K-mer
Hash 
Table
Traverse 
(G)
edge_cnt
out_degree
Euler path 
for G
D
+W
e
D
.W
e
RBL
Column Decoder
W
B
L1
R
B
L1
RWL1
M1
M2
SL1
SL2
RWL2
SA 
WWL1
M3
SL3
RWL3
Ctrl
W
B
L2
W
B
L3
M
o
d
if
ie
d
 R
o
w
 D
ec
o
d
er
Din-Intra
Din-Inter
D
+W
e
D
.W
e DD
+W
e
D
.W
e
Vwr
-Vwr
SA_out1
SA_out2
G
lo
b
al
 D
ec
o
de
r
C-Sub.
Decoder
D
river
D
ec
od
er
Reconfig. SA
D
river
D
ec
od
er Driver
D
ec
od
er
C-Sub.
Decoder
D
river
D
ec
od
er
Reconfig. SA
D
river
D
ec
od
er Driver
D
ec
od
er
C-Sub.
Decoder
D
river
D
ec
od
er
Reconfig. SA
D
river
D
ec
od
er Driver
D
ec
od
er
C-Sub.
Decoder
D
river
D
ec
od
er
Reconfig. SA
D
river
D
ec
od
er Driver
D
ec
od
er
GRWL
Bank
Bank
Bank
C-Sub.
Decoder
D
river
D
ec
o
d
er
Reconfig. SA
D
river
D
ec
o
d
er Driver
D
ec
o
d
er
C-Sub.
Decoder
D
river
D
ec
od
er
Reconfig. SA
D
river
D
ec
od
er Driver
D
ec
od
er
Row Buffer
DPU DPU
HDD/SDD
Sparse 
Graph-G
start
Clk
Clk Clk
Vsense Vref
OUT OUT
RBL
SL
~
Rhigh
Rlow
Vlow Vhigh
read
Vsense
R
M
1
R
1
Is
e
n
se
SA
R
re
f
Ir
e
f
VrefR
M
1
R
1
Is
e
n
se
R
re
f
Ir
e
f
(b)
VDD
H
o
st
PANDA C-Sub.
GWBL
Write Driver
Figure 3. PANDA platform: (a) Memory organization, (b) Computational
sub-array, (c) The new reconfigurable sense amplifier designed to implement
a full-set of 2- and 3-input logic operations.
in Fig. 3b, the computational memory sub-array (C-Sub.) of
PANDA consists of a modified memory row decoder, column
decoder, write driver, and reconfigurable Sense Amplifier
(SA). The data-parallel intra-sub-array computation of sub-
array is timed and controlled using a Controller (ctrl) w.r.t.
the physical address of operands.
PANDA is especially designed to support bulk bit-wise
operations between operands stored in each BL. Therefore,
the in-memory computational throughput is solely limited
by the physical memory row size i.e. 4KB/8KB in modern
main memory chips. Digital Processing Units (DPU) are also
shared between computational sub-arrays to handle nonparallel
computational load of the platform. In the following, we
explain different elements and the supported functions by
PANDA.
C. PIM Operations
Write Operation: To write ‘0’ (/‘1’) in a cell, e.g. in the
cell of 1st column and 2nd row (M2 in Fig. 3b), the associated
write driver first pulls WBL1 to negative (/positive) write
voltage. This will provide a preset charge current flow from
−Vwr to GND (/+Vwr to GND) that eventually changes the
cell’s resistance to Low-RP / (High-RAP ).
Reference Selection and Bit-line Computing: PANDA
leverages the reference selection and bit-line computing
Table I
CONTROL BITS FOR RECONFIGURABLE SA.
Operations CAND3 CMAJ COR3 CM Active SA Row Init.
Read 0 0 0 1 SA-III No
(N)AND3/(N)AND2 1 0 0 0 SA-III No/Yes
(N)OR3/(N)OR2 0 0 1 0 SA-I No/Yes
X(N)OR2 1 1 1 0 SA-I-II-III Yes
Maj (Carry)/Min 0 1 0 0 SA-II No
XOR3 (Sum) 1 1 1 0 SA-I-II-III No
method on top of a novel reconfigurable SA design shown in
Fig. 3c to handle memory read and in-memory computation.
The main idea of reference selection is to simultaneously
compare the resistance state of selected SOT-MRAM cell(s)
with one or multiple reference resistors in SA(s) to generate
the results. PANDA’s SA consists of three sub-SAs with a total
of four reference resistors. The ctrl unit could pick the proper
reference using enable control bits (CAND3, CMAJ , COR3,
CM ) to realize the memory read and a full-set of 2- and 3-
input logic functions, as tabulated in the Table I. We designed
and tuned the sense circuit based on StrongARM latch [32]
shown in Fig. 3c. Each read/in-memory computing operation
requires two clock phases: pre-charge (Clk ‘high’) and sensing
(Clk‘low’). For instance, to realize the read operation, the
memory row decoder first activates the corresponding RWL,
then a small sense current (Isense) flows from the selected
cell to ground, and generates a sense voltage (Vsense) at
the input of SA-III. This voltage is accordingly compared
with the memory mode reference voltage activated by CM
(Vsense,P<Vref,M<Vsense,AP), as shown in Fig. 4a. The SA-III
produces high (/low) voltage if the path resistance is higher
(/lower) than RM (memory reference resistance), i.e. RAP
(/RP ). PANDA could implement one-threshold in-memory
operations ((N)AND, (N)OR, etc.) by activating multiple
RWLs simultaneously, and only by activating one SA’s enable
at a time e.g. by setting CAND3 to ‘1’, 3-input AND/NAND
logic can be readily implemented between operands located
in the same bit-line. To implement 2-input logics, two rows
initialized by ‘0’/‘1’ are considered in every sub-array such
that functions can be made out of 3-input functions.
Addition: PANDA’s SA is enhanced with a unique cir-
cuit design that allows single-cycle implementation of ad-
dition/subtraction (add/sub) operation quite efficiently. By
activating three memory rows at the same time (RWL1,
RWL2, and RWL3 in Fig. 3b), OR3, Majority (MAJ) and
AND3 functions can be readily realized through SA-I, SA-
II, and SA-III, respectively. Each SA compares the equivalent
resistance of parallel connected input cells and their cascaded
access transistors with a programmable references by SA
(ROR3/RMAJ/RAND3). The idea of voltage comparison
between Vsense and Vref to realize these functions is depicted
on Fig. 4a. While there are several addition-in-memory de-
signs in non-volatile memory domain, they typically apply
a large circuitry after SA to realize a multi-cycle design. In
order to implement a single-cycle addition operation, we then
reformulate the full-adder Boolean expression to make it PIM-
friendly. We noticed when majority function of three input is
0, the Sum can be implemented by OR3 function and when
majority function is 1, Sum can be achieved through AND3
function. This behavior can be implemented by a multiplexer
circuit shown in Add-box in Fig. 3c. The Boolean logic of
such in-memory addition function is written as:
Carry = AB +AC +BC = Maj(A,B,C) (3)
Sum = ((AB +AC +BC).(A+B + C)) + ((AB +AC +BC).(ABC))
= Maj(A,B,C).OR(A,B,C) +MAJ(A,B,C).AND(A,B,C)
= Carry.OR(A,B,C) + Carry.AND(A,B,C)
(4)
4The carry-out of the full-adder can be directly produced
by MAJ function (Carry in Fig. 3c) just by setting CMAJ
to ‘1’ in a single memory cycle. For MAJ operation, RMAJ
is set at the midpoint of RP //RP //RAP (‘0’,‘0’,‘1’) and
RP //RAP //RAP (‘0’,‘1’,‘1’), as depicted in Fig. 4a. Now,
assume M1, M2, and M3 operands (Fig. 3b), the PANDA can
generate Carry-MAJ and Sum-XOR3 in-memory logics in a
single memory cycle. The ctrl’s configuration for such add
operation is tabulated in Table I.
Comparison: PANDA platform offers a single-cycle imple-
mentation of XOR3 in-memory logic (Sum). To realize the bulk
bit-wise comparison operation based on XNOR2, one memory
row in each PANDA’s sub-array is initialized to ‘1’. In this way,
XNOR2 can be readily implemented out of XOR3 function.
Therefore, every memory sub-array can potentially perform
parallel comparison operation without need to external add-on
logic or multi-cycle operation.
MCD
M
R
D
SA 
WD
A 
C
tr
l
XNOR2
B 
XOR2/
XNOR2
M
R
D
MCD
W
B
L1
R
B
L1
RWL1
M1
M2
SL1
SL2
RWL2
SA 
WWL1
M
R
D
MCD
V1
W
B
L
1
R
B
L
1
RWL1
M1
M2
SL1
SL2
RWL2
SA 
WWL1
M3
SL3
RWL3
Sub-array
Bank
GRB
GWL GBLc_ad
r_
a
d
C
tr
l
Compute.
Sub.
D
riv
e
r
Ctrl
Compute.
Sub.
D
riv
e
r
Ctrl
LRB
Compute.
Sub.
D
riv
e
r
Ctrl
Compute.
Sub.
D
riv
e
r
Ctrl
LRB
G
R
D
Mat
Reconfig. SA
Bank
Bank
Bank
Bank
buffer
I/O
Ctrl
Bank
Vref1
RAND
V
s
e
n
s
e
Vref2
ROR
RMem
Iref
CM
R
B
L
Isense
RMAJ
Latch
Rst
Local Row Buffer
Cmd
Decoder
Cmd
Add Timing Ctrl
Data flow ctrl
CNOR3
Ctrl
Rst
RWLn
RWLm
RWL1
MRD-extension
A
B
Din-Intra
Shared Local Row Buffer
CNAND3
Iref
Res-box
RAND3
Iref
RMAJ
Iref
RNOR3
RM
Iref
XOR3 (Sum)/
XNRO2*
CM
CNOR3
CMAJ
CAND3
RBL
Isense (CAND3 , CMAJ , CNOR3 , CM )
T
1
T
2
Carry
R
B
L
Isense
C
SA-unit Cap-net
Reconfigurable SA
WD
CMAJ
Write Driver (WD)
Ctrl
M
R
D
Ctrl
Computational Sub-array
Din-Inter
1        1        1        0      SA-I/II/III            No
CAND3   CMAJ  COR3  CM       Activ. SA
XNOR2
read 0        0        0        1         SA-III                  No
MAJ3
a
d
d
1        1        1        0      SA-I/II/III           Yes
XOR3
0        1        0        0          SA-II                  No
Init. Row Set?
C
tr
l
G
R
D GWL
GRB
MATMATMAT
GBL
MATMATMAT
MRAM Bank
Din-Intra
Din-Inter
D
+
W
e
D
.W
e DD
+
W
e
D
.W
e
D
+
W
e
D
.W
e
VDD
-VDD
SA_out1
SA_out2
BWT
XNOR_Match
XNOR_Match
CRef
IM_Add
+
+-
MT
low
high
Boundary
AdditionComparison
MT Query
XNOR_Match
XNOR_Match
Count
fwd
DPU
Mismatch
(z)
insertion
deletion
Ref-Genome
(S)
GRB
GWL GBLc_ad
r_
a
d
C
tr
l
Compute.
Sub.
D
riv
e
r
Ctrl
Compute.
Sub.
D
riv
e
r
Ctrl
LRB
Compute.
Sub.
D
riv
e
r
Ctrl
Compute.
Sub.
D
riv
e
r
Ctrl
LRB
G
R
D
SA
M
E
M
 (
S
A
)
pre-computed
pre-stored
MEM (MT)
VP,P VAP,PVAP,AP
AND2OR2
VP VAP
Read
Vsense
R
M
1
R
1
Is
e
n
s
e
R
M
2
R
2
SA
R
A
N
D
 o
r
Ir
e
f
Vref
Vsense
R
M
1
R
1
Is
e
n
s
e
SA
R
M
Ir
e
f
Vref
(a)
VP,P,P VP,P,AP VAP,AP,AP
MAJ
Vsense
R
M
1
R
1
Is
e
n
s
e
R
M
2
R
2
SA
R
M
A
J/
R
A
N
D
3
/
R
O
R
3
Ir
e
f
VrefR
M
3
R
3
VP,AP,AP
R
O
R
(b)
(c)
AND3OR3 30 40 50 60 70 80 90 100 110 1200
100
200
RAP RP
20 25 30 35 40 45 50 55 600
100
200
(RAP//RAP) (RAP//RP) (RP//RP)
10 15 20 25 30 35 40Vsense (mV)
0
100
200
(RAP//RAP//RAP) (RAP//RAP//RP) (RP//RP//RAP) (RP//RP//RP)
43.31 mv
14.62 mv5.82 mv
4.28 mv
(a) (b)
Figure 4. (a) Reference comparison to realize in-memory operations, (b)
Monte-Carlo simulation of Vsense.
D. Performance Analysis
Functionality: To verify the circuit functionality of
PANDA’s sub-array, we first model SOT-MRAM cell by jointly
applying the Non-Equilibrium Green’s Function (NEGF) and
Landau-Lifshitz-Gilbert (LLG) with spin Hall effect equations
[29], [5]. We then develop a Verilog-A model of 2-transistor 1-
resistor SOT-MRAM device with parameters listed in Table II
to co-simulate with other peripheral CMOS circuits displayed
in Fig. 3 in Cadence Spectre and SPICE. We use 45nm
North Carolina State Uni ersity (NCSU) Product Development
Kit (PDK) library [33] for our circuit analysis. The transient
simulation result of a single 256×256 sub-array is shown in
Fig. 5. We take M1, M2, and M3 as three SOT-MRAM cells
located in the first column as the inputs for our evaluation.
Table II
DEVICE PARAMETERS
Parameter Value
Free layer dimension (W × L× t)FL 60× 40× 2 nm3
SHM dimension 60× 80× 2 nm3
Demagnetization Factor, Dx; Dy ; Dz 0.066; 0.911; 0.022
Spin flip length, λsh 1.4 nm
Spin hall angle, θsh 0.3
Gilbert Damping Factor, α 0.007
Saturation Magnetization, Ms 850 kA/m
Oxide thickness, tox 1.2 nm
RA product, RAp / TMR 10.58 Ω · µm2 / 171.2%
Supply voltage 1 V
CMOS technology 45 nm
SOT-MRAM cell area 69 F 2
Access transistor width 4.5F
Cell aspect Ratio 1.91
-100
0
100
0 1 2 3 4 5 6 7
Time (ns)
-0.5
0
0.5
1
0
0.5
1
Cl
k (V
)
0
0.5
1
0
1
2
0
0.5
1
R
W
L
 
(V
)
0
0.5
1
 
in
pu
ts
 
M
1M
2M
3
M3 M2 M1
0
0.5
1
SA
-I
 
 
(V
)
0
0.5
1
SA
-II
I
 
 
(V
)
0
20
40
(m
V)
V
sense
VRef-OR VRef-AND
0 2 4 6 8 10 12
Time (ns)
0
0.5
1
(V
)
0
0.5
1
SA
-II
 
 
(V
)
V
sense
(V)
Write I
m
 (7A)
1st exp.
2st exp.
~25nA
~900nA
SA
out(V)
'1'
'1'
'0'
Eval. Eval. Read
Precharge
V
sense
<V
ref
Eval. Eval.Precharge
1
1OR3
Carry
Sum
AND3
compute. compute. compute.
0 1
1 1
100
0
0 0 0
0
1
1
compute.
000 100 110 111
V
sense
>V
ref-AND
Figure 5. Transient simulation wave-forms of PANDA’s sub-array and its
reconfigurable SA for performing single-cycle in-memory operations.
Here, we consider four input combination scenarios for the
write operation, as indicated by 000, 100, 110, and 111 in
Fig. 5. For the sake of clarity of wave-forms, we assume a
3ns period clock synchronises the write and read operation.
However, a 2ns period can be used for a reliable read and
in-memory computation.
During the precharge phase of SA (Clk=1), ±Vwrite voltage
is applied to the WBL to change the MRAM cell resistance to
Rlow=5.6kΩ or Rhigh=15.17 kΩ. Prior to the evaluation phase
(Eval.) of SA, WWL and WBL is grounded while RBL is fed
by the very small sense current, Isense= 3 µA. In the evaluation
phase, RWL goes high and depending on the resistance state of
parallel bit-cells and accordingly SL, Vsense is generated at the
first input of SAs, when Vref is generated at the second input
of SAs. The voltage comparison between Vsense a d Vref for
AND3 and OR3 and the output of SAs are plotted in Fig. 5. For
example, we observe only when Vsense>Vref,AND (M1M2M3=
111), the SA-III outputs binary ‘1’, whereas output is ‘0. Fig. 5
also shows the in-memory XOR3 function (Sum) accomplished
in a single memory cycle through three SA outputs.
Reliability: We assess the variation tolerance in the pro-
posed sub-array and SA circuit by running a rigorous Mont-
Carlo simulation. We run the simulation for 10000 iterations
considering two source of variations in SOT-MRAM cells, first
σ = 5% process variation on the Tunneling MagnetoResistive
(TMR) and second a σ = 2% variation on the Resistance-Area
product (RAP). The results illustrated in Fig. 4b proves that
the sense margin reduces by increasing the number of selected
input cells for in-memory operations. This can be alleviated by
increasing the oxide thickness tox of SHE-MTJ as thoroughly
discussed in [34]. In this way, the tox was increased from
1.5nm to 2nm. This increased the sense margin by ∼45mV
which considerably enhances the reliability.
Sub-array level Performance: To explore the hardware
overhead of PANDA on top of an standard unmodified SOT-
MRAM platform, we perform an iso-capacity performance
comparison. We develop both platforms with a sample 32Mb-
5single Bank, 512-bit Data Width in NVSim memory evaluation
tool. The circuit level data is adopted from our circuit level
simulation and then fed into an NVSim-compatible PIM
library to report the results. Table III lists the performance
measures for dynamic energy, latency, leakage power, and
area. We observe there is a ∼30% increase in the area to
support the proposed in-memory computing functions for
genome assembly. As for dynamic energy, the PANDA shows
an increase in R (Read) energy in spite of power gating
mechanism used in the reconfigurable SA to turn off non-
selected SAs (SA-I and -II while reading operation). In this
way, C-Add (C stands for Computation) requires ∼2.4× more
power compared with a single SA read operation. However,
Table III shows PANDA is able to offer a close-to-read latency
for C-AND3 and C-Add compared with the standard design.
There is also an increase in leakage power obviously coming
from the add-on CMOS circuitry.
Table III
PERFORMANCE COMPARISON BETWEEN AN STANDARD SOT-MRAM CHIP
AND PANDA.
Designs area
(mm2)
dynamic energy
(nJ)
latency
(ns) leak. power(mW)R W C-AND3 C-Add R W C-AND3 C-Add
Standard 7.06 0.57 0.66 - - 3.85 4.5 - - 402
PANDA 9.3 0.78 0.69 0.85 1.93 3.91 4.59 3.91 3.91 586
E. Software Support
PANDA is designed to be an efficient and independent accel-
erator for DNA assembly, nevertheless it needs to be exposed
to programmers and system-level libraries to use it. PANDA
could be directly connected to the memory bus or through
PCI-Express lanes as a third party accelerator. Thus, it could
be integrated similar to that of GPUs. So, an ISA and a virtual
machine for parallel and general-purpose thread execution
need to be developed like the NVIDIA’s PTX [35]. With that,
at install time, the programs are translated to the PANDA’s ISA
discussed here to implement the in-memory functions listed
in Table 1. We introduce PANDA Mem insert (des, src, size)
instruction to read a source data from the memory and write
it back to a destination memory location consecutively. The
size of input vectors for in-memory computation could be at
most a multiple of PANDA’s sub-array row size. PANDA Cmp
(src1, src2, size) performs parallel bulk bit-wise comparison
operation between source vector 1 and 2. PANDA Add (src1,
src2, size) runs element-wise addition between cells located
in a same column as will be explained in next section.
III. PANDA ALGORITHM AND MAPPING
The genome assembly algorithm consists of three main
stages visualized in Fig. 6. First, creating a hash table out
of chopped short reads (k-mers) and keeping a count of each
distinct k-mer; second, constructing a de Bruijn Graph with
Hashmap; third, traversing through de Bruijn Graph for Euler
Path1. There is a final stage called scaffolding to close the gaps
between contigs, which is the result of the denovo assembly
[2].
The first three stages always take most fraction of execute
time and computational resources (over 80%) in both CPU
1The stage II and III are so-called contig. generation
Figure 6. The genome assembly stages.
and GPU implementations [2]. To effectively handle the huge
number of short reads, we modularized the assembly algorithm
by focusing on parallelizing the main steps by loading only
the necessary data at each stage into PANDA platform, and
leave stage-4 as our future work.
A. Stage One: Hash Table
Algorithm 1 shows the reconstructed Hashmap(S,k) proce-
dure in which the algorithm takes k-mer from the original
sequence (S) in each iteration, creates a hash table entry (key)
for that, and assigns its frequency (value) to 1. This step is
visualized in Fig. 7. If the k-mer is already in the table, it will
calculate a new frequency (New frq) by adding the previous
frequency by one and update the value. As indicated, Hashmap
procedure can be implemented through PANDA Cmp (com-
parison), PANDA Add (addition), and PANDA Mem insert
(memory W/R) in-memory operations. Such functions are iter-
atively used in every step of for loop and PANDA is specially
designed to handle such computation-intensive load through
performing comparison, summing, and copying operations.
Figure 7. The hash table generation out of k-mers.
Considering the fact that the number of different keys
in Hash table is almost comparable to the genome size G,
the memory space requirement to save the hash is given by
∼ 2×G× (k + 1) bits (The factor of 2 is given to represent
2 bits per nucleotide). For instance, storing Hash table for
human genome with G ∼3×109 and k=32 requires ∼23GB
mostly associated with storing the key. Due to very large mem-
ory space requirement of hash table for assembly-in-memory
algorithm [2], we partition these tables into multiple sub-
arrays to fully leverage PANDA’s parallelism, and to maximize
computation throughput. Obviously, larger memory units [36]
and distributed memory schemes [2], [37] are preferable.
The proposed correlated partitioning and mapping method-
ology, as shown in Fig. 8a, locally stores correlated regions
6Algorithm 1 Procedure Hashmap(S, k)
Step-1. Initialization:
1: hashtable named Hashmap = {}
Step-2. Fill out the table:
2: for i := 0 to length(S)-k+1 do
3: k mer ← S[i : i+ k] . copy values of S[i to i+ k] into variable k mer
4: if PANDA˙Cmp(k mer,Hashmap) == 0 then
5: PANDA Mem˙insert(k mer, 1)
6: else
7: New frq ← PANDA Add(k mer, 1) . increment frq by 1
8: PANDA Mem˙insert(k mer,New frq) . insert into Hashmap again
9: end if
10: end for
11: return Hashmap
of k-mer (980 rows) vectors, where each row stores up to
128 bps (A,C,G,T encoded by 2 bits) and value (32 rows)
vectors in the same sub-array. For counting the frequencies
of each distinct k-mer, the ctrl first reads and parses the short
reads from the original sequence bank to the specific sub-array.
As depicted in Fig. 8a, assuming S=CGTGTGCA as the short
read, the k-mers- ki-ki+n are extracted and written into the
consecutive memory rows of k-mer region. However, when a
new query such as ki+3 arrives (while ki-ki+2 are already in
the memory), it will be first written to the temp region. A
parallel in-memory comparison operation (PANDA Cmp) will
be performed between temp data and already-stored k-mers.
Fig. 8b intuitively shows PANDA Cmp procedure, where entire
temp row can be compared with a previous k-mer row in a
single cycle. Then, a built-in ctrl’s AND unit in DPU readily
takes all the results to determine the next memory operation
according to the algorithm. To increase the frequency of
a specific k-mer, PANDA Add is leveraged to perform in-
memory addition without sending data to off-chip processor.
1 1 1 0 1 1
WL
BWT
S = ATTCG$
low
high
FM-Index
Gene
..TTC...
AACGT... ...ATTCG... ...ATTAA
Query
R = TTC
0 1
1 0
0 0
1 1
0 1
B
W
T 
(C
)
1 1
1 0
0 0
1 1
0 1
1 0
1 0
0 0
1 1
0 1
R
W
L
T
C
G
A
0 0 1 1 0 1
not
matched
not
matched
0 0
0 1
0 1
1 1
1 1
0 1
1 1
1 0
1 1
0 1
1 1
1 0
G
et
 c
o
rr
e
sp
o
nd
in
g 
m
ar
ke
r
DPU
marker_add
col_add
BW matrix
C
R
ef
A T
G T
C G
G
T
G C G
C
or
re
sp
o
nd
in
g 
m
ar
ke
r
4
-r
9
8
0
-rTC G A
32
-b
it
Compute
K-mer
(key)
reserved
0 0
a0
a1
a30
a31
b0
b1
b30
b31
c0
c1
c30
c31
d0
d1
d30
d31
 Carry
(a)
Compute.
 Sub-arrays
Lcp
Gene
..CTA...
AACGT... ...TCCTA... ...ATTAA
Query
S: TCCTA$
R:CTA
matched
DPU
DPU
BWT(S) = G$TCTA
R:TTC
R:TTC
(a)
MT
reserved
32
e0
e1
e30
e31
f0
f1
f30
f31
g0
g1
g30
g31
h0
h1
h30
h31
+1
0 0
1 0
0 0
1 1
0 1
1 1
1 0
0 0
1 1
0 1
1 1
1 0
0 0
1 1
0 1
1 1 1 0 1 1
not
matched
matched
+1
+1
matched
C
R
ef
m
ar
ke
r
B
W
T 
(C
)
i SA F             L
0 5 $  ATTC  G
1 0 A  TTCG  $
2 3 C  G$AT  T
3 4 G  $ATT  C
4 2 T  CG$A  T
5 1 T  TCG$  A
A T C G
0 0 0 1
0 0 0 1
0 1 0 1
0 1 1 1
0 2 1 1
1 2 1 1
First Column
Occ. tableBW matrix
Last Column 
A
C
G
T
$
LA
LC
LG
LT
3B (X2bits)Table Size
Mem. Size
3B (X2bits) 
750MB
Stored? no yes
(BWT)
Count(A)
Count(C)
Count(G)
Count(T)
A  C  G  T
d
d
d
d
3B×4 int.
45GB
no
(3B×4 int.)/d
100MB (d=128)
no
(3B×4 int.)/d
100MB (d=128)
yes
+ =
=1
3B int.
Suffix
11GB
yes
750MB
(Sorted BWT)
Sampled
Occ. table
Marker 
Table 
(MT) 
A  C  G  T A  C  G  T
A C G T
Count
Array 
(SA) 
BWT
CRef
MT
reserved
(c) (d)
MT
reserved
method-II
Sub-array1
Sub-array2
contigs 2
scaffolds 3
2
GTGC
TGCT
GCTT
C
o
n
ti
g
-I
: 
C
G
T
G
C
T
T
CTTA
TTAG
TTAC
TACG
ACGG
C
o
n
ti
g
-I
I:
 T
T
A
C
G
G
TAGG C
o
n
ti
g
-I
II
: 
T
T
A
G
G
Original Sequence Bank
Hash Table
G G
T
C G
G G
TG G
temp
8
-r
value
3
2
-r
TC G AGC G TAG A
TC G AG C G TAG A CC T
T
1 1
1 0 1 0
1 1 0 1 1 1
WL
1
P
i 
=
 F
1
,i
 
P
i-
1
=
 F
1
,i
-1
 
P
i-
2
 ≠
 F
1
,i
-2
 
2 3
4
1 1 0 0
ki
ki+1
TC G G C G TAG A CC T G
ki ≠ kj
DPU
P
IM
_
 X
N
O
R
1
2
k-mer   k3
WL 1 1 1 0
1 1
1 1
1 1 0 1 1 1
WL
1 1 1 1
ki = kj
DPU
(P
ar
al
le
l 
X
N
O
R
)
if
 k
i 
≠
 k
j
ki+3kiki+1
T G
ki+2
ki+2
M
E
M
_
in
se
rt
 
(k
_
m
er
, 
1
)
M
E
M
_
in
se
rt
 
(k
_
m
er
, 
N
ew
_
fr
eq
)
Base
T
G
A
C
Binary code
0 0
0 1
1 0
1 1
Sub-array Organization
Procedure: Hashmap(S, k)
for i 0 to length(S)-k +1:
              k_mer  S[i to i+k]       
            if PIM_XNOR (k_mer, Hashmap) == 1:
                   MEM_insert (k_mer, 1)
             else
                    Old_freq  Hashmap [k_mer]
                    New_freq  Old_freq + 1              
                  MEM_insert (k_mer, New_freq)        
       Return Hashmap
Procedure: Hashmap(S, k)
for i 0 to length(S)-k +1:
              k_mer  S[i to i+k]       
else
       Return Hmap
Procedure: DeBruijn (Hashmap, k)
for each k_mer in Hmap.keys():
node_1 k_mer [0 to k-2]
       Return Nodes and Edges
 node_2  k_mer[1 to k-1]
MEM_insert node_1 into Nodes_list
      MEM_insert edges_list (node1, node2) 
Procedure: Traverse (G)
for i 0 to i<N:
(b)
(c)
if G[i][j] > 0       
if G[j][i] > 0       
Fleury-Algorithm(G, v, edge_count, out_degree[])
       Return Euler path 
out_degree[i]  PIM_Add (out_degree[i] + int(G[i][j]))
Edge_count   PIM_Add (Edge_count, 1)
in_degree[i]  PIM_Add (in_degree[i] + int(G[i][j]))
Edge_count  PIM_Add (Edge_count + 1)
            if PIM_XNOR (k_mer, Hmap) == 1:
                   MEM_insert (k_mer, 1)
                   New_freqPIM_Add (k_mer, 1)
                  MEM_insert (k_mer, New_freq) 
temp
k-mer
G G
CT
T
T
T
T A TG GT
A GT G
ki+3
1
m
a
tc
h
fr
e
q
u
e
n
c
y
C G AG C G TAG A CC T
TC G G C G TAG A CC T G
C
A
A
AA
M
E
M
_
in
se
rt
 
(k
_
m
er
, 
N
ew
_
fr
eq
)
         v1  v2  v3  v4  v5  v6
v1     0    0    1    0    0    0
v2    1    0    0    0    0    0
v3    0    0    0    2    0    0
v4    0    1    0    0    0    1
v5    0    0    0    0    0    0
v6    0    0    0    0    1    0
memory-intensive
adjacency matrix-G 
v1 0
1
m
-2
m
-1
0 1 m
-1Src.
(2) allocation
allocation
PA
N
D
A
 C
hi
p 
0
PANDA Chip M-1
2
D
st
.
m
-2
m
-3
m
ap
pi
n
g
PA
N
D
A
 C
hi
p 
m
-1
1 1 1 0 1 1
RWL
0 1
1 1 1 0
A T
G T
C G
G
T
G C G
4
-r
9
8
0
-r
TC G A
32
-b
it
Compute
K-mer
(key)
0 0
0 0
Original Sequence Bank
Hash Table
G G
T
C G
G G
TG G
temp
8
-r
value
3
2
-r
TC G AGC G TAG A
TC G AG C G TAG A CC T
T
1 1
1 0 1 0
1 1 0 1 1 1
RWL
1 1 0 0
ki
ki+1
TC G G C G TAG A CC T G
ki ≠ kj
DPU
P
A
N
D
A
_
 C
m
p
1
2
RWL 1 1 1 0
1 1
1 1
1 1 0 1 1 1
RWL
1 1 1 1
ki = kj
DPU
(P
ar
al
le
l 
X
N
O
R
)
if
 k
i 
≠
 k
j
ki+3
ki ki+1
T G
ki+2
ki+2
M
em
_
in
se
rt
 
(k
_
m
er
, 
1
)
M
em
_
in
se
rt
 
(k
_
m
er
, 
N
ew
_
fr
q
)
Base
T
G
A
C
Binary code
0 0
0 1
1 0
1 1
Sub-array Organization
temp
k-mer
G G
CT
T
T
T
T A TG GT
A GT G
ki+3
1
m
a
tc
h
fr
e
q
u
e
n
c
y
C G AG C G TAG A CC T
TC G G C G TAG A CC T G
C
A
A
AA
M
em
_
in
se
rt
 
(k
_
m
er
, 
N
ew
_
fr
q
)
(a)
(b)
sparse matrix-G 
v3   v1   v4   v2   v6   v5
v1   v2   v3   v4   v4   v6 
  1     1     2     1     1     1
Hashmap
0% 25% 50% 75% 100%
Hashmap Scaffold. 
Scaffold. 
Graph construction and traversal
Graph construction and traversal human genome
wheat genome
focus of this work
GCGT
CGTG GTGC
TGCG
TGCTGCTT
v1 v2
v3
v4
v5 v6
deBruijn graph-G
Src
Dst
#E
1
2
3
4
5
Sub-array
#0
y1
1
0
0
0
1
0
0
1
1
0
1
0
1
1
1
1
0
0
1
1
0
0
0
1
1
1
0
1
0
0
0
0
0
0
0
 parallel 
compute
x
out_degree[v3]
in_degree[v3]
out_degree[v3]>in_degree[v3]
1
1
0
0
0
1
0
0
1
1
0
1
0
1
1
1
1
0
0
1
1
0
0
0
1
1
1
0
1
0
0
0
0
0
0
0
v1
PA
N
D
A
 C
hi
p 
1
Vertices are (k-1)-mers
Edges are k-mers
         v1  v2  v3  v4  v5  v6
v1     0    0    1    0    0    0
v2    1    0    0    0    0    0
v3    0    0    0    2    0    0
v4    0    1    0    0    0    1
v5    0    0    0    0    0    0
v6    0    0    0    0    1    0
memory-intensive
adjacency matrix-G 
v1 0
1
m
-2
m
-1
0 1 m
-1Src.
allocation
PA
N
D
A
 C
hi
p 
0
2
D
st
.
m
-2
m
-3
m
ap
pi
n
g
PA
N
D
A
 C
hi
p
 m
-1
sparse matrix-G 
v3   v1   v4   v2   v6   v5
v1   v2   v3   v4   v4   v6 
  1     1     2     1     1     1
GCGT
CGTG GTGC
TGCG
TGCTGCTT
v1 v2
v3 v4
v5 v6
deBruijn graph-G
CGTGC       2
GTGCG       1
TGCGT       1
GCGTG       1
GTGCT        1
TGCTT         1
k_mer    value
Hash table
Src
Dst
#E
1
2
3
4
5
Sub-array
#0
y1
1
0
0
0
1
0
0
1
1
0
1
0
1
1
1
1
0
0
1
1
0
0
0
1
1
1
0
1
0
0
0
0
0
0
0
 parallel 
compute
x
out_degree[v3]
in_degree[v3]
out_degree[v3]>in_degree[v3]
1
1
0
0
0
1
0
0
1
1
0
1
0
1
1
1
1
0
0
1
1
0
0
0
1
1
1
0
1
0
0
0
0
0
0
0
PA
N
D
A
 C
hi
p
 1
TGCGT
v3   v1   v4   v2   v6   v5
v1   v2   v3   v4   v4   v6 
  1     1     2     1     1     1
Src
Dst
#E x
x
x
x
x
x
1
0
x
x
x
x
1
0
x
x
e0
e1
e30
e31
f0
f1
f30
f31
g0
g1
g30
g31
h0
h1
h30
h31
v1
v1
v2
x
x
x
x
x
x
x
x
c0
c1
c30
c31
d0
d1
d30
d31
x
x
0
1
x
x
x
x
0
1
x
x
x
x
x
xv3
v4
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xv5
x
x
1
0
x
x
x
x
x
x
1
0v6
x
x
x
x
v2 v3 v4 v5 v6
out_degree[v4]=2
RWL
RWL
RWL
RWL
CGTG GTGC
TGCG
TGCT
v2
v4
v6v3
in_degree[v4]=2
1
x
x
x
1
x
1
x
x
x
0
x
PANDA_Add
Sum x
0
x
x
x
0
x
0
x
x
x
1
ou
t_
d
eg
re
e
co
m
pu
te
x
x
x
x
x
x
1
0
x
x
x
x
1
0
x
x
x
x
x
x
x
x
x
x
x
x
0
1
x
x
0
0
0
1
0
0
x
x
1
1
x
x
1
1
x
x
0
0
x
0
x
1
x
0
x
1
x
1
1
0
x
x
x
x
x
x
1
0
x
x
x
x
RWL
RWL
PANDA_Cmp
0 0 1 1 1 1
sh
or
t 
re
a
ds
k-
m
er
s
CGTGC (CGTGCGTGCTT)
GTGCG (CGTGCGTGCTT)
TGCGT (CGTGCGTGCTT)
GCGTG (CGTGCGTGCTT)
CGTGC (CGTGCGTGCTT)
GTGCT (CGTGCGTGCTT)
TGCTT (CGTGCGTGCTT)
S = CGTGCGTGCTT...
K
 =
 5
 
CGTGC     2
GTGCG     1
TGCGT     1
GCGTG     1
GTGCT      1
TGCTT       1
Hash Table
k-mer   value
sh
or
t 
re
a
ds
k-
m
er
s
K
 =
 5
 
Figure 8. (a) The proposed correlated data partitioning and mapping
methodology for creating hash table, (b) Realization of parallel in-memory
comparator (PANDA Cmp) between k-mers in a computational sub-array.
B. Stage Two: Graph Construction
The next step is to construct and access a de Bruijn graph
based on the Hash structure to rapidly lookup of a value
associated with each k-mer. For each entry (of length k) in
the Hashmap, we will make two nodes, one with the prefix
of length k-1 nd other with the suffix of length k-1 (e.g.
CGTGC→ CGTG and GTGC), nd connect an edge between
         v1  v2  v3  v4  v5  v6
v1     0    0    1    0    0    0
v2    1    0    0    0    0    0
v3    0    0    0    2    0    0
v4    0    1    0    0    0    1
v5    0    0    0    0    0    0
v6    0    0    0    0    1    0
memory-intensive
adjacency matrix-G 
v1 0
1
m
-2
m
-1
0 1 m
-1Src.
allocation
PA
N
D
A
 C
hi
p
 0
2
D
st
.
m
-2
m
-3
m
ap
pi
n
g
P
A
N
D
A
 C
h
ip
 m
-1
sparse matrix-G 
v3   v1   v4   v2   v6   v5
v1   v2   v3   v4   v4   v6 
  1     1     2     1     1     1
GCGT
CGTG GTGC
TGCG
TGCTGCTT
v1 v2
v3 v4
v5 v6
deBruijn graph-G
CGTGC       2
GTGCG       1
TGCGT       1
GCGTG       1
GTGCT        1
TGCTT         1
k_mer    value
Hash table
Src
Dst
#E
1
2
3
4
5
Sub-array
#0
y1
1
0
0
0
1
0
0
1
1
0
1
0
1
1
1
1
0
0
1
1
0
0
0
1
1
1
0
1
0
0
0
0
0
0
0
 parallel 
compute
x
out_degree[v3]
in_degree[v3]
out_degree[v3]>in_degree[v3]
1
1
0
0
0
1
0
0
1
1
0
1
0
1
1
1
1
0
0
1
1
0
0
0
1
1
1
0
1
0
0
0
0
0
0
0
P
A
N
D
A
 C
h
ip
 1
Figure 9. Graph construction with sparse matrix with partitioning, allocation
and parallel computation.
them. For each Hash table entry with n as the frequency, n
edges is then added between the two nodes. The de Bruijn
graph G for the sample Hash table in Fig. 7 is constructed
in Fig. 9 (step 1). Algorithm 2 shows the reconstructed de
Bruijn procedure for PANDA taking Hashmap data and k as
input returning matrix G. For each key within Hash table,
PANDA Mem insert instruction creates an entry in G for
node1 and node2s. Leveraging adjacency matrix representation
for direct mapping of such humongous sparse graph into
memory comes at a cost of significantly increased memory
requirement and run time. The size of adjacency matrix will
be V×V for any graph with V nodes, where sparse matrix
could be represented by a 3×E matrix, where E is the total
number of edges in the graph. PANDA utilizes sparse matrix
representation shown in Fig. 9 (step 2) for mapping purpose.
Each entry in the 3rd row of the sparse matrix represents the
number of connections between two nodes in 1st and 2nd
rows.
Algorithm 2 Procedure DeBruijn(Hashmap, k)
Step-1. Initialization:
1: G=[], Nodes List=[], i=1
Step-2. Sparse Graph Construction:
2: for ∀k mer ∈ Hashmap.keys(), i+ + do
3: node 1← k mer[0 : k − 2]
4: node 2← k mer[1 : k − 1]
5: PANDA Mem˙insert(G[1][i], node 1)
6: PANDA Mem˙insert(G[2][i], node 2)
7: PANDA Mem˙insert(G[3][i], Hashmap[k mer])
8: end for
9: return G
To balance workloads of each PANDA’s chip and maximize
parallelism, we leverage interval-block partitioning method.
We use hash-based approach [38] by splitting the vertices
into M intervals and then divide edges into M2 blocks as
shown Fig. 9 (step 3: mapping). Then each block is allo-
cated to a chip (step 4: allocation) and mapped to its sub-
arrays. Having an m-vertex sub-graph with Ns activated sub-
arrays (size=x × y), each sub-array can process n vertices
(n ≤ f |n ∈ N, f = min(x, y)) (step 5: parallel computation).
In this way, the number of processing sub-arrays for an N -
vertex sub-graph can be formulated as, Ns =
⌈
N
f
⌉
.
After graph construction, it is possible to perform a round
of simplification on the sparse graph stored in PANDA without
loss of information to avoid fragmentation of the graph. As
7Algorithm 3 Procedure Find Start Vertex(G)
Step-1. Initialization:
1: start← 0, end← 0
2: edge cnt← 0 . For counting number of edges in G
3: Len← size(G)
Step-2. Find the start vertex:
4: for n in Nodes do
5: in degree[i]← 0
6: out degree[i]← 0
7: end for
8: for n in Nodes do
9: for k :=1 to Len do
10: if PANDA˙Cmp(G[1][k], n) then . node n has an out-going edge
11: out degree[n]← PANDA Add(out degree[n], int(G[3][k]))
12: in degree[int(G[2][k])]← PANDA Add(in degree[int(G[2][k])], int(G[3][k]))
13: edge cnt← PANDA Add(edge cnt, int(G[3][k]))
14: end if
15: end for
16: if PANDA˙Cmp(out degree[n], in degree[n] + 1) then
17: start← n
18: else
19: start← first node
20: end if
21: end for
22: return start & edge cnt & out degree
a matter of fact, the blocks are broken up each time a short
read starts or ends leading to linear connected subgraphs [22].
This fragmentation imposes longer execution time and larger
memory space. The simplification process easily merges two
nodes within memory if a node-A has only one out-going edge
directed to node-B with only one in-going edge.
C. Stage Three: Traversal for Euler Path
The input of this stage will be a sparse representation of
graph G. For traversing all the edges, we will use Fleurys
algorithm to find the Euler path of that graph (a path which
traverses all edges of a graph). Basically, a directed graph
has a Euler path if the in degree and out degree2 of every
vertex is same or, there are exactly two vertices which have
—in degree - out degree—= 1. Finding the starting vertex is
very important to generate the Eulerian path and we cannot
consider any vertex as a starting vertex. The reconstructed
PIM-friendly algorithm for finding the start vertex in graph-G
in shown in Algorithm 3. For each node, this stage deals with
massive number of iteratively-used PANDA Add to calculate
the number of in degree, out degree and edge cnt (total
number of edges). Moreover, in order to check the condition
(—out degree = in degree—+ 1), parallel PANDA Cmp op-
eration is required.
After finding the start node, PANDA has to traverse through
the length of sparse matrix G from the starting vertex and
check two conditions for each edge and accordingly add
qualified edges to the Eulerian Path. We show the recon-
structed Fleury algorithm in Algorithm 4. If an edge is not
a bridge and is not the last edge of the graph, we will
add (start, v) in the Eulerian path and remove that edge.
isV alidNextEdge() function will check if the edge (u, v)
is valid to be included into our Euler path. If v is the
only adjacent vertex remaining for u, it means that, we have
traversed all other adjacent vertices, so we will take this edge,
otherwise we wont. The second condition counts the number
2The in degree[i] shows how many edges are coming into a vertex-i and
out degree[i] means how many out-going edges vertex-i has.
of reachable nodes from u before and after removing the edge.
If the number changes/decreases, it means that, the edge was a
bridge (removing it will disconnect the graph into two parts).
If it is a bridge, we cannot remove the edge from our Graph;
otherwise we will remove the edge and add it into Euler path.
Algorithm 4 Procedure Fleury(G, node, edge count,
out degree)
1: for v := 0 to N do
2: if G[1][k] == start then
3: v ← G[2][k]
4: if isV alidNextEdge(v) then
5: PANDA Mem˙insert(v) . add (start, v) in the Eulerian path
6: PANDA Add(out degree[start],−1)
7: PANDA Add(G[3][k],−1) . remove one edge from the graph
8: PANDA Add(edge cnt,−1)
9: end if
10: end if
11: Fleury(G, v, edge count, out degree[]) . run Fleury again for the next node v
12: end for
In the interest of space, we show out /in degree and
edge cnt mapping and computation in the PANDA platform
in Fig. 10, which basically sums up all the entries of a
particular node i of valid links connected to a vertex to
find the start vertex. As can be seen, we use the sparse
matrix representation to store the matrix-G. In our mapping
technique, each column is assigned to a distinct source vertex
in the graph and then filled out with the number of edges
(#E) only linked to existing destination vertices in a vertical
fashion. Therefore, we do not assign destination vertices to
the memory rows as in direct adjacency matrix mapping.
Here, we consider a 4-bit representation for the simplicity. For
example, v4 has out-going edges to v2 and v6 that are stored
vertically in a sub-array. PANDA could perform parallel in-
memory addition to calculate the total number of out degree
for all nodes in parallel. For this task, two rows in the sub-
array are initialized to zero as Carry reserved rows such that
they can be selected along with two operands (here v4→v2
data (0001) and v4→v6 data (0001)) to perform parallel in-
memory addition. To perform parallel addition operation and
generate initial Carry and Sum bits, PANDA takes every three
rows to perform a parallel in-memory addition. The results
are written back to the memory reserved space (Resv.). Then,
next step only deals with multi-bit addition of resultant data
starting bit-by-bit from the LSBs of the two words continuing
towards MSBs. Then PANDA is able to perform comparison
between number of out degree and in degree for each node
in parallel to determine the start node. After finding the start
node as shown in Fig. 10, contig. generation can be readily
accomplished through finding the Eulerian path and putting
together each vertex data from different sub-arrays.
IV. PERFORMANCE ESTIMATION
A. Setup
Accelerator: To the best of our knowledge, this work is the
first to explore the performance of a PIM platform for genome
assembly problem, therefore, we have to create the evaluation
test bed from scratch to have an impartial comparison with
both von-Neumann and non-von-Neumann architectures. We
configure the PANDA’s computational memory sub-array with
8Sub-array #w
1 1 1 0 1 1
WL
BWT
S = ATTCG$
low
high
FM-Index
Gene
..TTC...
AACGT... ...ATTCG... ...ATTAA
Query
R = TTC
0 1
1 0
0 0
1 1
0 1
B
W
T 
(C
)
1 1
1 0
0 0
1 1
0 1
1 0
1 0
0 0
1 1
0 1
R
W
L
T
C
G
A
0 0 1 1 0 1
not
matched
not
matched
0 0
0 1
0 1
1 1
1 1
0 1
1 1
1 0
1 1
0 1
1 1
1 0
G
et
 c
o
rr
e
sp
o
nd
in
g 
m
ar
ke
r
DPU
marker_add
col_add
BW matrix
C
R
ef
A T
G T
C G
G
T
G C G
C
o
rr
es
p
o
n
d
in
g 
m
ar
ke
r
4
-r
9
8
0
-rTC G A
32
-b
it
Compute
K-mer
(key)
reserved
0 0
a0
a1
a30
a31
b0
b1
b30
b31
c0
c1
c30
c31
d0
d1
d30
d31
 Carry
(a)
Compute.
 Sub-arrays
Lcp
Gene
..CTA...
AACGT... ...TCCTA... ...ATTAA
Query
S: TCCTA$
R:CTA
matched
DPU
DPU
BWT(S) = G$TCTA
R:TTC
R:TTC
(a)
MT
reserved
32
e0
e1
e30
e31
f0
f1
f30
f31
g0
g1
g30
g31
h0
h1
h30
h31
+1
0 0
1 0
0 0
1 1
0 1
1 1
1 0
0 0
1 1
0 1
1 1
1 0
0 0
1 1
0 1
1 1 1 0 1 1
not
matched
matched
+1
+1
matched
C
R
ef
m
ar
ke
r
B
W
T 
(C
)
i SA F             L
0 5 $  ATTC  G
1 0 A  TTCG  $
2 3 C  G$AT  T
3 4 G  $ATT  C
4 2 T  CG$A  T
5 1 T  TCG$  A
A T C G
0 0 0 1
0 0 0 1
0 1 0 1
0 1 1 1
0 2 1 1
1 2 1 1
First Column
Occ. tableBW matrix
Last Column 
A
C
G
T
$
LA
LC
LG
LT
3B (X2bits)Table Size
Mem. Size
3B (X2bits) 
750MB
Stored? no yes
(BWT)
Count(A)
Count(C)
Count(G)
Count(T)
A  C  G  T
d
d
d
d
3B×4 int.
45GB
no
(3B×4 int.)/d
100MB (d=128)
no
(3B×4 int.)/d
100MB (d=128)
yes
+ =
=1
3B int.
Suffix
11GB
yes
750MB
(Sorted BWT)
Sampled
Occ. table
Marker 
Table 
(MT) 
A  C  G  T A  C  G  T
A C G T
Count
Array 
(SA) 
BWT
CRef
MT
reserved
(c) (d)
MT
reserved
method-II
Sub-array1
Sub-array2
contigs 2
scaffolds 3
2
GTGC
TGCT
GCTT
C
o
n
ti
g
-I
: 
C
G
T
G
C
T
T
CTTA
TTAG
TTAC
TACG
ACGG
C
o
n
ti
g
-I
I:
 T
T
A
C
G
G
TAGG C
o
n
ti
g
-I
II
: 
T
T
A
G
G
Original Sequence Bank
Hash Table
G G
T
C G
G G
TG G
temp
8
-r
value
3
2
-r
TC G AGC G TAG A
TC G AG C G TAG A CC T
T
1 1
1 0 1 0
1 1 0 1 1 1
WL
1
P
i 
=
 F
1
,i
 
P
i-
1
=
 F
1
,i
-1
 
P
i-
2
 ≠
 F
1
,i
-2
 
2 3
4
1 1 0 0
ki
ki+1
TC G G C G TAG A CC T G
ki ≠ kj
DPU
P
IM
_
 X
N
O
R
1
2
k-mer   k3
WL 1 1 1 0
1 1
1 1
1 1 0 1 1 1
WL
1 1 1 1
ki = kj
DPU
(P
ar
al
le
l 
X
N
O
R
)
if
 k
i 
≠
 k
j
ki+3kiki+1
T G
ki+2
ki+2
M
E
M
_
in
se
rt
 
(k
_
m
er
, 
1
)
M
E
M
_
in
se
rt
 
(k
_
m
er
, 
N
ew
_
fr
eq
)
Base
T
G
A
C
Binary code
0 0
0 1
1 0
1 1
Sub-array Organization
Procedure: Hashmap(S, k)
for i 0 to length(S)-k +1:
              k_mer  S[i to i+k]       
            if PIM_XNOR (k_mer, Hashmap) == 1:
                   MEM_insert (k_mer, 1)
             else
                    Old_freq  Hashmap [k_mer]
                    New_freq  Old_freq + 1              
                  MEM_insert (k_mer, New_freq)        
       Return Hashmap
Procedure: Hashmap(S, k)
for i 0 to length(S)-k +1:
              k_mer  S[i to i+k]       
else
       Return Hmap
Procedure: DeBruijn (Hashmap, k)
for each k_mer in Hmap.keys():
node_1 k_mer [0 to k-2]
       Return Nodes and Edges
 node_2  k_mer[1 to k-1]
MEM_insert node_1 into Nodes_list
      MEM_insert edges_list (node1, node2) 
Procedure: Traverse (G)
for i 0 to i<N:
(b)
(c)
if G[i][j] > 0       
if G[j][i] > 0       
Fleury-Algorithm(G, v, edge_count, out_degree[])
       Return Euler path 
out_degree[i]  PIM_Add (out_degree[i] + int(G[i][j]))
Edge_count   PIM_Add (Edge_count, 1)
in_degree[i]  PIM_Add (in_degree[i] + int(G[i][j]))
Edge_count  PIM_Add (Edge_count + 1)
            if PIM_XNOR (k_mer, Hmap) == 1:
                   MEM_insert (k_mer, 1)
                   New_freqPIM_Add (k_mer, 1)
                  MEM_insert (k_mer, New_freq) 
temp
k-mer
G G
CT
T
T
T
T A TG GT
A GT G
ki+3
1
m
a
tc
h
fr
e
q
u
e
n
c
y
C G AG C G TAG A CC T
TC G G C G TAG A CC T G
C
A
A
AA
M
E
M
_
in
se
rt
 
(k
_
m
er
, 
N
ew
_
fr
eq
)
         v1  v2  v3  v4  v5  v6
v1     0    0    1    0    0    0
v2    1    0    0    0    0    0
v3    0    0    0    2    0    0
v4    0    1    0    0    0    1
v5    0    0    0    0    0    0
v6    0    0    0    0    1    0
memory-intensive
adjacency matrix-G 
v1 0
1
m
-2
m
-1
0 1 m
-1Src.
(2) allocation
allocation
PA
N
D
A
 C
h
ip
 0
PANDA Chip M-1
2
D
st
.
m
-2
m
-3
m
ap
pi
n
g
P
A
N
D
A
 C
h
ip
 m
-1
1 1 1 0 1 1
RWL
0 1
1 1 1 0
A T
G T
C G
G
T
G C G
4
-r
9
8
0
-r
TC G A
32
-b
it
Compute
K-mer
(key)
0 0
0 0
Original Sequence Bank
Hash Table
G G
T
C G
G G
TG G
temp
8
-r
value
3
2
-r
TC G AGC G TAG A
TC G AG C G TAG A CC T
T
1 1
1 0 1 0
1 1 0 1 1 1
RWL
1 1 0 0
ki
ki+1
TC G G C G TAG A CC T G
ki ≠ kj
DPU
P
A
N
D
A
_
 C
m
p
1
2
RWL 1 1 1 0
1 1
1 1
1 1 0 1 1 1
RWL
1 1 1 1
ki = kj
DPU
(P
ar
al
le
l 
X
N
O
R
)
if
 k
i 
≠
 k
j
ki+3
ki ki+1
T G
ki+2
ki+2
M
em
_
in
se
rt
 
(k
_
m
er
, 
1
)
M
em
_
in
se
rt
 
(k
_
m
er
, 
N
ew
_
fr
q
)
Base
T
G
A
C
Binary code
0 0
0 1
1 0
1 1
Sub-array Organization
temp
k-mer
G G
CT
T
T
T
T A TG GT
A GT G
ki+3
1
m
a
tc
h
fr
e
q
u
e
n
c
y
C G AG C G TAG A CC T
TC G G C G TAG A CC T G
C
A
A
AA
M
em
_
in
se
rt
 
(k
_
m
er
, 
N
ew
_
fr
q
)
(a)
(b)
sparse matrix-G 
v3   v1   v4   v2   v6   v5
v1   v2   v3   v4   v4   v6 
  1     1     2     1     1     1
Hashmap
0% 25% 50% 75% 100%
Hashmap Scaffold. 
Scaffold. 
Graph construction and traversal
Graph construction and traversal human genome
wheat genome
focus of this work
GCGT
CGTG GTGC
TGCG
TGCTGCTT
v1 v2
v3
v4
v5 v6
deBruijn graph-G
Src
Dst
#E
1
2
3
4
5
Sub-array
#0
y1
1
0
0
0
1
0
0
1
1
0
1
0
1
1
1
1
0
0
1
1
0
0
0
1
1
1
0
1
0
0
0
0
0
0
0
 parallel 
compute
x
out_degree[v3]
in_degree[v3]
out_degree[v3]>in_degree[v3]
1
1
0
0
0
1
0
0
1
1
0
1
0
1
1
1
1
0
0
1
1
0
0
0
1
1
1
0
1
0
0
0
0
0
0
0
v1
P
A
N
D
A
 C
h
ip
 1
Vertices are (k-1)-mers
Edges are k-mers
         v1  v2  v3  v4  v5  v6
v1     0    0    1    0    0    0
v2    1    0    0    0    0    0
v3    0    0    0    2    0    0
v4    0    1    0    0    0    1
v5    0    0    0    0    0    0
v6    0    0    0    0    1    0
memory-intensive
adjacency matrix-G 
v1 0
1
m
-2
m
-1
0 1 m
-1Src.
allocation
PA
N
D
A
 C
h
ip
 0
2
D
st
.
m
-2
m
-3
m
ap
pi
n
g
P
A
N
D
A
 C
h
ip
 m
-1
sparse matrix-G 
v3   v1   v4   v2   v6   v5
v1   v2   v3   v4   v4   v6 
  1     1     2     1     1     1
GCGT
CGTG GTGC
TGCG
TGCTGCTT
v1 v2
v3 v4
v5 v6
deBruijn graph-G
CGTGC       2
GTGCG       1
TGCGT       1
GCGTG       1
GTGCT        1
TGCTT         1
k_mer    value
Hash table
Src
Dst
#E
1
2
3
4
5
Sub-array
#0
y1
1
0
0
0
1
0
0
1
1
0
1
0
1
1
1
1
0
0
1
1
0
0
0
1
1
1
0
1
0
0
0
0
0
0
0
 parallel 
compute
x
out_degree[v3]
in_degree[v3]
out_degree[v3]>in_degree[v3]
1
1
0
0
0
1
0
0
1
1
0
1
0
1
1
1
1
0
0
1
1
0
0
0
1
1
1
0
1
0
0
0
0
0
0
0
PA
N
D
A
 C
h
ip
 1
TGCGT
v3   v1   v4   v2   v6   v5
v1   v2   v3   v4   v4   v6 
  1     1     2     1     1     1
Src
Dst
#E
x
x
x
x
x
x
1
0
x
x
x
x
1
0
x
x
e0
e1
e30
e31
f0
f1
f30
f31
g0
g1
g30
g31
h0
h1
h30
h31
v1
v1
v2
x
x
x
x
x
x
x
x
c0
c1
c30
c31
d0
d1
d30
d31
x
x
0
1
x
x
x
x
0
1
x
x
x
x
x
xv3
v4
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xv5
x
x
1
0
x
x
x
x
x
x
1
0v6
x
x
x
x
v2 v3 v4 v5 v6
out_degree[v4]=2
RWL
RWL
RWL
RWL
CGTG GTGC
TGCG
TGCT
v2
v4
v6v3
1
x
x
x
1
x
1
x
x
x
0
x
PANDA_Add
Sum x
0
x
x
x
0
x
0
x
x
x
1
ou
t_
d
eg
re
e
co
m
pu
te
x
x
x
x
x
x
1
0
x
x
x
x
1
0
x
x
x
x
x
x
x
x
x
x
x
x
0
1
x
x
0
0
0
1
0
0
x
x
1
1
x
x
1
1
x
x
0
0
x
0
x
1
x
0
x
1
x
1
1
0
x
x
x
x
x
x
1
0
x
x
x
x
PANDA_Cmp
0 0 1 1 1 1
sh
o
rt
 r
ea
ds
k-
m
er
s
CGTGC (CGTGCGTGCTT)
GTGCG (CGTGCGTGCTT)
TGCGT (CGTGCGTGCTT)
GCGTG (CGTGCGTGCTT)
CGTGC (CGTGCGTGCTT)
GTGCT (CGTGCGTGCTT)
TGCTT (CGTGCGTGCTT)
S = CGTGCGTGCTT...
K
 =
 5
 
CGTGC     2
GTGCG     1
TGCGT     1
GCGTG     1
GTGCT      1
TGCTT       1
Hash Table
k-mer   value
sh
o
rt
 r
ea
ds
k-
m
er
s
K
 =
 5
 
 C
ar
ry
1
0
1
0
0
1
1
0
v1
v3
0
0
1
0
x
x
x
x
x
x
x
x
x
x
0
1
1
0
x
xv4
v2
x
x
x
x
x
x
x
x
x
x
x
x
x
x
1
0v6
x
x
x
x
x
x
x
x
x
x
x
x
1
0
x
x
v2 v3 v4
RWL
RWL
RWL
RWL
0
x
0
x
0
x
0
x
0
x
0
x
PANDA_Add
Su
m
1
0
1
0
0
1
0
1
0
0
1
0
ou
t_
d
eg
re
e
 c
o
m
p
u
te
in_degree[v4]=2
out_degree[v1]=1
out_degree[v2]=1
out_degree[v3]=2
v5
v5
v6
v1 v4
v4
v2 v5
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
v6
0
0
0
0
0
0
0
0
0
0
0
0
RWL
RWL
0
0
0
0
0
0
0
0
0
0
0
0
out_degree[v5]=0
out_degree[v6]=1
v1 v2 v3 v4 v5 v6 finding start node
out_degree[v3]=in_degree[v3]+1
GCGT
CGTG GTGC
TGCG
TGCTGCTT v1
v2
v4
v5 v6
deBruijn graph-G
1
GCTT
v5
C G
TG
T G
T
T
fr
e
q
u
e
n
c
y
G
st
ar
t
C
A
T G C
Sub-array #z
C
TG
T G
T
T
G
A
TG G C
T
A
A
A
Sub-array #z
TG G C
C G T G
A
Contig gen.
v4
v3
Figure 10. PANDA in-memory addition and comparison scheme for finding
the start vertex.
1024 rows and 256 columns, 4×4 memory matrix (with 1/1 as
row/column activation) per bank organized in H-tree routing
manner, 16×16 banks (with 1/1 as row/column activation)
in each memory chip. For comparison, we consider five
computing platforms: 1) A general purpose processor (GPP):
a Quad Core Intel Core i7-7700 CPU @ 3.60GHz processor
with 8192MB DIMM DDR4 1600MHz RAM and 8192KB
Cache; 2) A processing-in-STT-MRAM platform capable of
performing bulk bit-wise operations [39]; 3) A recently devel-
oped processing-in-SOT-MRAM platform for DNA sequence
alignment optimized to perform comparison-intensive opera-
tions [5]; 4) A processing-in-ReRAM accelerator designed for
accelerating bulk bit-wise operations [40]; 5) A processing-
in-DRAM accelerator based on Ambit [7] working with triple
row activation mechanism to implement various functions. The
detailed evaluation framework developed for PIM platforms
is shown in Fig. 11. All PIM platforms have an identical
physical memory configuration as PANDA. Additionally, we
developed a similar cross-layer simulation framework start-
ing from device-level simulation all the way to circuit- and
architectural level as explained for PANDA in Section II.D.
The results of the architecture evaluation of all PIM platforms
were then fed to a high-level in-house simulator developed in
Matlab to perform each genome assembly stage based on our
customized and PIM-friendly algorithm and estimate the over-
all performance. It is noteworthy that DPU was developed in
HDL and the performance results was extracted with synopsys
design compiler [41] and fed to the developed NVSim library
for each PIM platform.
To evaluate the CPU performance, we use Trinity-v2.8.5
[23] which was shown to be sensitive and efficient in re-
covering full-length transcripts. Trinity constructs de Bruijn
graph from short-read sequences and employs an enumeration
algorithm to score all branches, and keeps possible ones as
isoforms/transcripts.
Experiment: In our experiment, we create 60952 short reads
through Trinity sample genome bank with 519771 unique
k-mers. We initially set the k-mer length, k, to default 25,
and then change it to 22, 27, and 32 as typical values for
most genome assemblers. To clarify, the CPU executes the
Inchworm, Chrysalis, and Butterfly steps in Trinity, while PIM
platforms run three main procedures in genome assembly
shown in Fig. 6 i.e. Hashmap, DeBruijn, and Traverse for
under-test PIM platforms. We compare Trinity’s power con-
sumption and execution time to that of other PIM assemblers
MTJ modeling using 
NEGF-LLG 
(Verilog-A)D
ev
ic
e
Extracting Performance Parameters i.e. Delay, Energy, Area
(Spectre/Spice)
C
ir
cu
it
 
C
o
n
tr
o
ll
e
r 
 
(S
y
n
o
p
sy
s 
 D
e
si
g
n
 C
o
m
p
il
e
r)
  
STT-MRAM
1T1C circuit level 
(Spectre)
DRAM 
DRAM cell parameters 
from Rambus
ReRAM 
Design & Verification of a single 1024x256 sub-array 
(Cadence Spectre)
Circuit level 
(Spectre)
Default NVSim 
ReRAM .cell file
Verilog-A  1T1R 
ReRAM
Verilog-A 
1T1R STT-
MRAM
A
rc
h
it
ec
tu
re
 
Modified Cacti based on 
circuit level DRAM data
Extracting Performance Parameters i.e. Delay, Energy, Area for the system w.r.t. 
memory configuration file (.cfg)
Application-level Simulation with Matlab 
Extracting App-level Performance Parameters i.e. Delay, Energy, Area 
Configure Modified NVSIM for existing memory 
technologies
Verilog-A 
2T1R SOT-
MRAM
MTJ modeling using 
NEGF-LLG & SHE 
(Verilog-A)
SOT-MRAM
A
p
p
lic
a
ti
o
n
 Generate 60952 reads through 
Trinity with 519771 unique k-mers.
Figure 11. Evaluation framework developed for processing-in-memory plat-
forms.
by several measures. To have a fair comparison with such a
comprehensive assembler (that performs full genome assembly
task with scaffolding step), we penalized the PIM platforms
with ∼25% excessive time and power. We believe this could
provide a more realistic comparison with a von-Neumann
architecture-based assembler.
B. Run Time
The execution time of genome assembly task for different
platforms is reported in Fig. 12. For k=25, the CPU platform
executes the Inchworm, Chrysalis, and Butterfly steps [23] of
Trinity in ∼32s, where Chrysalis for clustering the contigs
and constructing complete de Bruijn graph takes the largest
fraction of the run time (28s) as expected. However, the
comparison operation-intensive Hashmap procedure for k-mer
analysis takes the largest fraction of execution time in all
PIM platforms (over 40% of total run time). Larger k-mer
length typically diminishes the de Bruijn graph connectivity
by simultaneously reducing the number of ambiguous repeats
in the graph and chance of overlap between two reads. That is
why run time for all platforms reduces with increase of k-mer
length.
We can observe that PIM platforms reduce the run time
remarkably compared to the CPU. As shown, PANDA reduces
the run time by ∼18× compared to the CPU platform for
k=22 k=25 k=27 k=32
0
5
10
15
20
25
30
35
40
R
un
 T
im
e 
(s)
k=22 k=25 k=27 k=32
0
10
20
30
40
ru
n
 ti
m
e 
(s)
Inchworm Chrysalis Butterfly Hashmap DeBruijn Traverse
k=22 k=25 k=27 k=32
0
50
100
150
200
250
300
Po
w
er
 (W
)
29s32s
918#
DRAM RRAMPANDA SOT
STT
CPU
98.5#
911#
                        CPU                        PIM
Figure 12. The breakdown of run time for under-test platforms running
different k-mer-length genome assembly task. In each bar group from left
to right: CPU, processing-in-STT-MRAM [39], PANDA, processing-in-SOT-
MRAM [5], processing-in-DRAM [7], and processing-in-RRAM [40].
9k=22 k=25 k=27 k=32
0
5
10
15
20
25
30
35
40
R
un
 T
im
e 
(s)
k=22 k=25 k=27 k=32
0
10
20
30
40
ru
n
 ti
m
e 
(s)
Inchworm Chrysalis Butterfly Hashmap DeBruijn Traverse
k=22 k=25 k=27 k=32
0
50
100
150
200
250
300
Po
w
er
 (W
) k=22k=25k=27k=32
0
5
10
15
Intel Core i7-7700 Hashmap DeBruijn Traverse
98.5#
911#
918#
32s 29s
                       PIM
                        CPU
PANDA
CPU
CPU
STT
STT
PANDA
SOT DRAM
DRAM
SOT
RRAM
RRAM
Figure 13. The breakdown of power consumption for PIM platforms running
different k-mer-length genome assembly task compared to CPU. In each
bar group from left to right: CPU, processing-in-STT-MRAM [39], PANDA,
processing-in-SOT-MRAM [5], processing-in-DRAM [7], and processing-in-
RRAM [40].
k=25 (18.8× on average over 4 different k-mer lengths). The
PANDA platform essentially accelerates the graph construction
and traversal stages by ∼21.5× compared with CPU platform.
Now, by increasing the k-length to 32, the higher speed-up is
even achievable. Compared with counterpart PIM platforms,
our X(N)OR-friendly design reduces the run time on average
by 4.2×, 2.5×, compared to STT-PIM [39], and SOT-PIM [5]
platforms as the fastest counterparts, respectively. This comes
from the fact that under-test PIM platforms require multi-
cycle operations to implement addition operation. Besides,
the SOT-based device intrinsically shows higher write speed
compared to STT devices. Compared to DRAM and RRAM
platforms, PANDA achieves on average 10.9× and 6× speed-
up for various length k-mer processing. It is worth pointing out
that the processing-in-DRAM platforms possess a destructive
computing operation and require multiple memory cycle to
copy the operands to particular rows before computation. As
for Ambit [7], 7 memory cycles are needed to implement in-
memory-X(N)OR function.
C. Power Consumption
We estimated the power consumption of different PIM
platforms for running different length k-mers compared to the
CPU platform as shown in Fig. 13. Based on our results, a
significant reduction in power consumption can be reported
for all under-test PIM platforms compared with the CPU. The
breakdown of energy consumption is also shown for the PIM
platforms, however this couldn’t be accurately achieved for
the CPU and overall power consumption is reported. In our
experiment, processing-in-SOT-MRAM design [5] achieves
the smallest power consumption (on average) to run the three
main procedures, as compared with the CPU and other PIM
platforms. The PANDA platform stands as the second most
power-efficient design. This is mainly due to the three-SA
based bit-line computing scheme in PANDA compared with
two-SA per bit-line technique in the counterpart design. While
the proposed scheme brings more speed-up compared with
the design in [5], it requires relatively more power. The
PANDA reduces the power consumption by ∼9.2× on average
compared with the CPU platform over different length k-
mers. Besides, it reduces the power consumption by ∼18%
compared with STT-MRAM [39] platform. The main reason
behind this improvement is more efficient addition operation in
PANDA. Addition operation requires additional memory cycles
in the STT-MRAM [39] platform to save carry bit back to the
k=22 k=25 k=27 k=32
0
10
20
30
40
R
un
 T
im
e 
(s)
k=22 k=25 k=27 k=32
0
10
20
30
40
ru
n
 ti
m
e 
(s)
Inchworm Chrysalis Butterfly Hashmap DeBruijn Traverse
k=22 k=25 k=27 k=32
0
50
100
150
200
250
Po
w
er
 (W
)
k=22k=25k=27k=32
0
5
10
15
Intel Core i7-7700 Hashmap DeBruijn Traverse
1 2 4 8
Parallelism Degree
0
100
200
300
Po
w
er
 C
on
um
pt
io
n 
(W
)
0
5
10
R
un
 ti
m
e 
(s)
PANDA SOT-PIM STT-PIM
1 2 4 8
Parallelism Degree
0
50
100
150
200
250
Po
w
er
 C
on
um
pt
io
n 
(W
)
0
2
4
6
8
10
R
un
 ti
m
e 
(s)
PANDA SOT-PIM STT-PTM data4 data6
29s32s
CPU STT-PIM PANDA SOT-PIM
PANDA
SOT-PIM
918#
STT-PIM
                          PIM
911#
98.5#
                          CPU
                           PIM
X: 2.244
Y (Stacked): 12.5
Y (Segment): 4.375
CPU Power on Trinity
PANDA
SOT-PIM
STT-PIM
Run timePower
Figure 14. Trade-off between power consumption and run-time w.r.t paral-
lelism degree in k=25.
memory and use it again for the computation of next bits.
Compared to DRAM and RRAM platforms, PANDA obtains
on average 2.11× and 55% power reduction for various length
k-mer processing.
D. Speed-up/Power-Efficiency Trade-off
We investigate the power-efficiency and speed-up of three
best under-test PIM platforms, based on the run time and
power consumption results in the previous subsections, by
tuning the number of active sub-arrays (Ns) associated with
the comparison and addition operations. A parallelism degree
(Pd) can be then defined as the number of replicated sub-arrays
to boost the performance of the PIM platforms through parallel
processing as shown in prior works [5], [12]. For example,
when Pd is set to 2, two parallel sub-arrays are undertaken
to process the in-memory operations, simultaneously. We
expect such parallelism to improve the performance of genome
assembly at the cost of sacrificing the power consumption and
area. Fig. 14 plots the existing trade-off between run time
and power consumption vs. Pd for k= 25. The estimated CPU
power budget required to execute Trinity is also shown. It
can be seen that for all platforms the run time reduces by
increasing the parallelism. For example for PANDA platform in
an extreme case, increasing Pd from 1 to 8 increases the power
consumption from ∼19W to 128W (∼7×) and reduces the
execution time by a factor of 3, which might not be a favorable
case. Therefore, a user can meticulously tailor the PANDA
performance to meet the system/application constraints. Here,
we show the optimum theoretical performance of PANDA and
other PIM platforms by pinpointing the intersection between
power and run time curves in Fig. 14. We observe that PANDA
achieves the smallest run time and power consumption task
with a Pd ∼2 compared with the others.
E. Memory Wall Challenge
The power-efficiency and speed-up of PIM platforms against
the von-Neumann architecture-based CPU was discussed in
prior subsections. Here, we further explore the reasons behind
the numbers reported by considering two new measures i.e.
Memory Bottleneck Ratio (MBR) and Resource Utilization
Ratio (RUR). We define MBR as the time fraction needed for
data transfer from/to on-chip or off-chip, when computation
has to wait for data i.e. memory wall happens. We also
10
define RUR as the time fraction in which the computation
resources are loaded with data. The memory wall is considered
as the main bottleneck that brings large power consumption
and lengthen execution time in CPU. The MBR is reported
in Fig. 15a. The peak throughput for each design in four
distinct k-mer lengths is taken into account for performing the
evaluation. This evaluation mainly considers the number of
memory access. As shown, the PANDA uses less than ∼17%
time for data transfer due to the PIM acceleration schemes,
while CPU’s MBR increases to 65% when k=25. Besides,
we observe that all the other PIM platforms except DRAM
also spend less than ∼17% time for data communication.
The smaller MBR can be translated as the higher RUR for
the accelerators plotted in Fig. 15b. The less MBR can be
understood as a higher RUR. We can see that with up to ∼
82%, PANDA achieves the highest RUR. Taking everything
into account, PIM acceleration schemes offer a high utilization
ratio (>60% excluding DRAM) confirming the conclusion
drawn in Fig. 15a. The memory wall evaluation shows the
efficiency of the PANDA platform for solving memory wall
challenge.
k=22 k=25 k=27 k=32
0
20
40
60
80
M
BR
 (%
)
CPU STT PANDA SOT DRAM RRAM
k=22 k=25 k=27 k=32
0
10
20
30
40
50
60
70
80
90
100
R
es
ou
rc
e 
Ut
iliz
at
io
n 
Ra
tio
 (%
)
k=22 k=25 k=27 k=32
0
20
40
60
80
M
BR
 (%
)
k=22 k=25 k=27 k=32
0
25
50
75
100
R
UR
 (%
)
(b)(a)
Figure 15. (a) The memory bottleneck ratio and (b) resource utilization ratio
for CPU and three under-test PIM platforms for running genome assembly
task.
V. CONCLUSION
In this paper, we presented PANDA as a new processing-in-
SOT-MRAM platform to accelerate the comparison/ addition-
extensive genome assembly application using PIM-friendly
operations. We developed PANDA based on a set of new
circuit-level schemes to realize a data-parallel computational
core for genome assembly. The platform is configured with a
novel data partitioning and mapping technique that provides
local storage and processing to fully utilize our customized
algorithm-level’s parallelism. The cross-layer simulation re-
sults demonstrate that PANDA reduces the execution time and
power respectively by ∼18× and ∼11× compared with the
CPU. Besides, speed-ups of up-to 2-4× can be obtained over
recent processing-in-MRAM platforms to perform the similar
task.
REFERENCES
[1] H. Li and N. Homer, “A survey of sequence alignment algorithms for
next-generation sequencing,” Briefings in bioinformatics, vol. 11, no. 5,
pp. 473–483, 2010.
[2] E. Georganas, A. Buluc¸, J. Chapman, L. Oliker, D. Rokhsar, and
K. Yelick, “Parallel de bruijn graph construction and traversal for de
novo genome assembly,” in SC’14: Proceedings of the International
Conference for High Performance Computing, Networking, Storage and
Analysis. IEEE, 2014, pp. 437–448.
[3] J. A. Chapman, I. Ho, S. Sunkara, S. Luo, G. P. Schroth, and D. S.
Rokhsar, “Meraculous: de novo genome assembly with short paired-end
reads,” PloS one, vol. 6, no. 8, p. e23501, 2011.
[4] F. Zokaee, H. R. Zarandi, and L. Jiang, “Aligner: A process-in-memory
architecture for short read alignment in rerams,” IEEE Computer Archi-
tecture Letters, vol. 17, no. 2, pp. 237–240, 2018.
[5] S. Angizi, J. Sun, W. Zhang, and D. Fan, “Aligns: A processing-in-
memory accelerator for dna short read alignment leveraging sot-mram,”
in 2019 56th ACM/IEEE Design Automation Conference (DAC). IEEE,
2019, pp. 1–6.
[6] H. Pourmeidani, S. Sheikhfaal, R. Zand, and R. F. DeMara, “Proba-
bilistic interpolation recoder for energy-error-product efficient dbns with
p-bit devices,” IEEE Transactions on Emerging Topics in Computing,
2020.
[7] V. Seshadri, D. Lee, T. Mullins, H. Hassan, A. Boroumand, J. Kim,
M. A. Kozuch, O. Mutlu, P. B. Gibbons, and T. C. Mowry, “Ambit: In-
memory accelerator for bulk bitwise operations using commodity dram
technology,” in 2017 50th Annual IEEE/ACM International Symposium
on Microarchitecture (MICRO). IEEE, 2017, pp. 273–287.
[8] S. Angizi and D. Fan, “Graphide: A graph processing accelerator
leveraging in-dram-computing,” in Proceedings of the 2019 on Great
Lakes Symposium on VLSI, 2019, pp. 45–50.
[9] J. Yu, R. Nane, I. Ashraf, M. Taouil, S. Hamdioui, H. Corporaal, and
K. Bertels, “Skeleton-based synthesis flow for computation-in-memory
architectures,” IEEE Transactions on Emerging Topics in Computing,
2017.
[10] S. Angizi, J. Sun, W. Zhang, and D. Fan, “Pim-aligner: a processing-
in-mram platform for biological sequence alignment,” in 2020 Design,
Automation & Test in Europe Conference & Exhibition (DATE). IEEE,
2020, pp. 1265–1270.
[11] S. Angizi, Z. He, A. S. Rakin, and D. Fan, “Cmp-pim: an energy-efficient
comparator-based processing-in-memory neural network accelerator,” in
Proceedings of the 55th Annual Design Automation Conference, 2018,
pp. 1–6.
[12] S. Li, C. Xu, Q. Zou, J. Zhao, Y. Lu, and Y. Xie, “Pinatubo: A
processing-in-memory architecture for bulk bitwise operations in emerg-
ing non-volatile memories,” in Proceedings of the 53rd Annual Design
Automation Conference, 2016, pp. 1–6.
[13] Z. I. Chowdhury, M. Zabihi, S. K. Khatamifard, Z. Zhao, S. Resch,
M. Razaviyayn, J.-P. Wang, S. S. Sapatnekar, and U. R. Karpuzcu,
“A dna read alignment accelerator based on computational ram,” IEEE
Journal on Exploratory Solid-State Computational Devices and Circuits,
vol. 6, no. 1, pp. 80–88, 2020.
[14] W. Kang, H. Wang, Z. Wang, Y. Zhang, and W. Zhao, “In-memory
processing paradigm for bitwise logic operations in stt–mram,” IEEE
Transactions on Magnetics, vol. 53, no. 11, pp. 1–4, 2017.
[15] S. Angizi, J. Sun, W. Zhang, and D. Fan, “Graphs: A graph processing
accelerator leveraging sot-mram,” in 2019 Design, Automation & Test in
Europe Conference & Exhibition (DATE). IEEE, 2019, pp. 378–383.
[16] A. Roohi, S. Sheikhfaal, S. Angizi, D. Fan, and R. F. DeMara, “Apgan:
Approximate gan for robust low energy learning from imprecise compo-
nents,” IEEE Transactions on Computers, vol. 69, no. 3, pp. 349–360,
2019.
[17] R. Li, C. Yu, Y. Li, T.-W. Lam, S.-M. Yiu, K. Kristiansen, and
J. Wang, “Soap2: an improved ultrafast tool for short read alignment,”
Bioinformatics, vol. 25, no. 15, pp. 1966–1967, 2009.
[18] C.-M. Liu, T. Wong, E. Wu, R. Luo, S.-M. Yiu, Y. Li, B. Wang, C. Yu,
X. Chu, K. Zhao et al., “Soap3: ultra-fast gpu-based parallel alignment
tool for short reads,” Bioinformatics, vol. 28, no. 6, pp. 878–879, 2012.
[19] J. Arram, T. Kaplan, W. Luk, and P. Jiang, “Leveraging fpgas for accel-
erating short read alignment,” IEEE/ACM transactions on computational
biology and bioinformatics, vol. 14, no. 3, pp. 668–677, 2016.
[20] S. F. Mahmood and H. Rangwala, “Gpu-euler: Sequence assembly using
gpgpu,” in 2011 IEEE International Conference on High Performance
Computing and Communications. IEEE, 2011, pp. 153–160.
[21] B. S. C. Varma, K. Paul, and M. Balakrishnan, “Fpga-based acceleration
of de novo genome assembly,” in Architecture Exploration of FPGA
Based Accelerators for BioInformatics Applications. Springer, 2016,
pp. 55–79.
[22] D. R. Zerbino and E. Birney, “Velvet: algorithms for de novo short
read assembly using de bruijn graphs,” Genome research, vol. 18, pp.
821–829, 2008.
[23] M. G. Grabherr, B. J. Haas, M. Yassour, J. Z. Levin, D. A. Thompson,
I. Amit, X. Adiconis, L. Fan, R. Raychowdhury, Q. Zeng et al., “Full-
length transcriptome assembly from rna-seq data without a reference
genome,” Nature biotechnology, vol. 29, no. 7, pp. 644–652, 2011.
11
[24] S. Goswami, K. Lee, S. Shams, and S.-J. Park, “Gpu-accelerated large-
scale genome assembly,” in 2018 IEEE International Parallel and
Distributed Processing Symposium (IPDPS). IEEE, 2018, pp. 814–
824.
[25] S. Ren, N. Ahmed, K. Bertels, and Z. Al-Ars, “An efficient gpu-based de
bruijn graph construction algorithm for micro-assembly,” in 2018 IEEE
18th International Conference on Bioinformatics and Bioengineering
(BIBE). IEEE, 2018, pp. 67–72.
[26] M. Lu, Q. Luo, B. Wang, J. Wu, and J. Zhao, “Gpu-accelerated
bidirected de bruijn graph construction for genome assembly,” in Asia-
Pacific Web Conference. Springer, 2013, pp. 51–62.
[27] S. Angizi, N. Ahmed Fahmi, W. Zhang, and D. Fan, “Pim-assembler:
A processing-in-memory platform for genome assembly,” in In 57th
Design Automation Conference (DAC), San Francisco, CA, July 19-23,
2020. IEEE, In press.
[28] S. Angizi, Z. He, and D. Fan, “Pima-logic: A novel processing-
in-memory architecture for highly flexible and energy-efficient logic
computation,” in Proceedings of the 55th Annual Design Automation
Conference, 2018, pp. 1–6.
[29] X. Fong, Y. Kim, K. Yogendra, D. Fan, A. Sengupta, A. Raghunathan,
and K. Roy, “Spin-transfer torque devices for logic and memory:
Prospects and perspectives,” IEEE Transactions on Computer-Aided
Design of Integrated Circuits and Systems, vol. 35, no. 1, pp. 1–22,
2015.
[30] S. Angizi, Z. He, F. Parveen, and D. Fan, “Imce: Energy-efficient bit-
wise in-memory convolution engine for deep neural network,” in 2018
23rd Asia and South Pacific Design Automation Conference (ASP-DAC).
IEEE, 2018, pp. 111–116.
[31] C.-F. Pai, L. Liu, Y. Li, H. Tseng, D. Ralph, and R. Buhrman, “Spin
transfer torque devices utilizing the giant spin hall effect of tungsten,”
Applied Physics Letters, vol. 101, no. 12, p. 122404, 2012.
[32] B. Razavi, “The strongarm latch [a circuit for all seasons],” IEEE Solid-
State Circuits Magazine, vol. 7, no. 2, pp. 12–17, 2015.
[33] (2011) Ncsu eda freepdk45. [Online]. Available: http://www.eda.ncsu.
edu/wiki/FreePDK45:Contents
[34] S. Angizi, Z. He, A. Awad, and D. Fan, “Mrima: An mram-based in-
memory accelerator,” IEEE Transactions on Computer-Aided Design of
Integrated Circuits and Systems, vol. 39, no. 5, pp. 1123–1136, 2019.
[35] (2018) Parallel thread execution isa version 6.1. [Online]. Available:
http://docs.nvidia.com/cuda/parallel-thread-execution/index.html
[36] R. Li, H. Zhu, J. Ruan, W. Qian, X. Fang, Z. Shi, Y. Li, S. Li, G. Shan,
K. Kristiansen et al., “De novo assembly of human genomes with
massively parallel short read sequencing,” Genome research, vol. 20,
no. 2, pp. 265–272, 2010.
[37] J. T. Simpson, K. Wong, S. D. Jackman, J. E. Schein, S. J. Jones, and
I. Birol, “Abyss: a parallel assembler for short read sequence data,”
Genome research, vol. 19, no. 6, pp. 1117–1123, 2009.
[38] G. Dai, T. Huang, Y. Chi, J. Zhao, G. Sun, Y. Liu, Y. Wang, Y. Xie,
and H. Yang, “Graphh: A processing-in-memory architecture for large-
scale graph processing,” IEEE Transactions on Computer-Aided Design
of Integrated Circuits and Systems, vol. 38, no. 4, pp. 640–653, 2018.
[39] S. Jain, A. Ranjan, K. Roy, and A. Raghunathan, “Computing in memory
with spin-transfer torque magnetic ram,” IEEE Transactions on Very
Large Scale Integration (VLSI) Systems, vol. 26, no. 3, pp. 470–483,
2017.
[40] M. Imani, Y. Kim, and T. Rosing, “Mpim: Multi-purpose in-memory
processing using configurable resistive memory,” in 2017 22nd Asia and
South Pacific Design Automation Conference (ASP-DAC). IEEE, 2017,
pp. 757–763.
[41] Synopsys, Inc., “Synopsys design compiler, product version 14.9.2014,”
2014.
Shaahin Angizi is currently a Ph.D. candidate in
Electrical Engineering at School of Electrical, Com-
puter and Energy Engineering, Arizona State Univer-
sity, Tempe, AZ, USA. His primary research inter-
ests include ultra-low power in-memory computing
based on volatile & non-volatile memories, brain
inspired (neuromorphic) computing, and accelerator
design for deep neural network and bioinformatics.
He is the recipient of Best Poster Award of Ph.D.
Forum at IEEE/ACM DAC in 2018, two Best Paper
Awards of IEEE ISVLSI in 2017 and 2018, and Best
Paper Award of ACM GLSVLSI in 2019. He is a student member of IEEE.
Naima Ahmed Fahmi is a Ph.D. student in Com-
puter Science at University of Central Florida, Or-
lando, Florida. She is working under the supervision
of Dr. Wei Zhang, Assistant Professor in Computer
Science, University of Central Florida. Naima’s re-
search focus is to apply Machine Learning methods
and algorithms to extract useful information from
large scale genomic data. Currently, she is working
with Cancer cell-lines and real patient’s samples
to predict and prognosis molecular markers for the
disease.
Wei Zhang received the MS and PhD degrees
in computer science from University of Minnesota
Twin Cities in 2011 and 2015, respectively. He
joined the Department of Computer Science, Uni-
versity of Central Florida, Orlando, Florida, as an
assistant professor in 2017. His primary research
interest lies at the interaction of computational biol-
ogy and machine learning, and has been focusing on
graph-based learning models for biomarker selection
and cancer outcome prediction. His other research
interests include cancer transcript variants and drug
sensitivity prediction. He received NSF CRII Award in 2018.
Deliang Fan is currently an Assistant Professor in
the School of Electrical, Computer and Energy Engi-
neering, Arizona State University, Tempe, AZ, USA.
Before joining ASU in 2019, he was an assistant
professor in Department of ECE at University of
Central Florida, Orlando, FL, USA. He received his
M.S. and Ph.D. degrees, under the supervision of
Prof. Kaushik Roy, in ECE from Purdue Univer-
sity, West Lafayette, IN, USA, in 2012 and 2015,
respectively. His primary research interests include
Energy Efficient and High Performance Big Data
Processing-In-Memory Circuit, Architecture and Algorithm, with applications
in Deep Neural Network, Data Encryption, Graph Processing and Bioinfor-
matics Acceleration-in-Memory system; Brain-inspired (Neuromorphic) and
Boolean Computing Using Emerging Nanoscale Devices like Spintronics and
Memristors; security of AI system. He has authored and co-authored 100+
peer-reviewed international journal/conference papers. He served as the TPC
member of DAC, ICCAD, DATE, GLSVLSI, ISVLSI, ASP-DAC, ISQED,
etc. He also served as the area technical chair of GLSVLSI, ISQED, etc.
