GRIM-Filter: Fast Seed Location Filtering in DNA Read Mapping Using
  Processing-in-Memory Technologies by Kim, Jeremie S. et al.
Kim et al.
METHODOLOGY
GRIM-Filter: Fast Seed Location Filtering
in DNA Read Mapping
Using Processing-in-Memory Technologies
Jeremie S. Kim1,6*, Damla Senol Cali1, Hongyi Xin2, Donghyuk Lee3, Saugata Ghose1,
Mohammed Alser4, Hasan Hassan6, Oguz Ergin5, Can Alkan*4, and Onur Mutlu*6,1
*Correspondence:
jeremiekim123@gmail.com,
calkan@cs.bilkent.edu.tr,
onur.mutlu@inf.ethz.ch
1Department of Electrical and
Computer Engineering, Carnegie
Mellon University, Pittsburgh, PA,
USA
6Department of Computer
Science, ETH Zu¨rich, Zu¨rich, CH
Full list of author information is
available at the end of the article
Abstract
Motivation: Seed location filtering is critical in DNA read mapping, a process
where billions of DNA fragments (reads) sampled from a donor are mapped onto
a reference genome to identify genomic variants of the donor. State-of-the-art
read mappers 1) quickly generate possible mapping locations for seeds (i.e.,
smaller segments) within each read, 2) extract reference sequences at each of the
mapping locations, and 3) check similarity between each read and its associated
reference sequences with a computationally-expensive algorithm (i.e., sequence
alignment) to determine the origin of the read. A seed location filter comes into
play before alignment, discarding seed locations that alignment would deem a
poor match. The ideal seed location filter would discard all poor match locations
prior to alignment such that there is no wasted computation on unnecessary
alignments.
Results: We propose a novel seed location filtering algorithm, GRIM-Filter,
optimized to exploit 3D-stacked memory systems that integrate computation
within a logic layer stacked under memory layers, to perform
processing-in-memory (PIM). GRIM-Filter quickly filters seed locations by
1) introducing a new representation of coarse-grained segments of the reference
genome, and 2) using massively-parallel in-memory operations to identify read
presence within each coarse-grained segment. Our evaluations show that for a
sequence alignment error tolerance of 0.05, GRIM-Filter 1) reduces the false
negative rate of filtering by 5.59x–6.41x, and 2) provides an end-to-end read
mapper speedup of 1.81x–3.65x, compared to a state-of-the-art read mapper
employing the best previous seed location filtering algorithm.
Availability: The code is available online at:
https://github.com/CMU-SAFARI/GRIM
1 Introduction
Our understanding of human genomes today is affected by the ability of modern
technology to quickly and accurately determine an individual’s entire genome. The
human genome is comprised of a sequence of approximately 3 billion bases that are
grouped into deoxyribonucleic acids (DNA), but today’s machines can identify DNA
only in short sequences (i.e., reads). Determining a genome requires three stages:
1) cutting the genome into many short reads, 2) identifying the DNA sequence
of each read, and 3) mapping each read against the reference genome in order
to analyze the variations in the sequenced genome. In this paper, we focus on
improving the third stage, often referred to as read mapping, which is a major
computational bottleneck of a modern genome analysis pipeline. Read mapping is
performed computationally by read mappers after each read has been identified.
Seed-and-extend mappers [5, 9, 34, 40, 85, 97] are a class of read mappers that
break down each read sequence into seeds (i.e., smaller segments) to find locations
in the reference genome that closely match the read. Figure 1 illustrates the five
ar
X
iv
:1
71
1.
01
17
7v
1 
 [q
-b
io.
GN
]  
2 N
ov
 20
17
Kim et al. Page 2 of 24
steps used by a seed-and-extend mapper. First, the mapper obtains a read (¶ in
the figure). Second, the mapper selects smaller DNA segments from the read to
serve as seeds (·). Third, the mapper indexes a data structure with each seed to
obtain a list of possible locations within the reference genome that could result in
a match (¸). Fourth, for each possible location in the list, the mapper obtains the
corresponding DNA sequence from the reference genome (¹). Fifth, the mapper
aligns the read sequence to the reference sequence (º), using an expensive sequence
alignment (i.e., verification) algorithm to determine the similarity between the read
sequence and the reference sequence.
… GAACTTGGAGTC TACGAGGGTTTC CTAACGTGCCTT GCATGTAGCTAC CTGACAGGAACT …
Reference Fragment
GAACTTGGAGTCTACGAGGGTTTCCTAACGTGCCTTGCATGTAGCTACCTGACAGGAACTGA
Read
GAACTTGGAGTC
TACGAGGGTTTC
CTAACGTGCCTT
GCATGTAGCTAC
Seeds
Data 
Structure
GAACTTGGAGTC
TACGAGGGTTTC
GCATGTAGCTAC
L1 L2 L3 L4
L5 L6 L7
L8 L9
Location lists for selected k-mers
Reference
Genome
Alignment / Verification
CTAACGTGCCTT L10
L11 L12 L13 L14
1
3
2
5
4
Figure 1: Flowchart of a seed-and-extend mapper.
To improve the performance of seed-and-extend mappers, we can utilize seed lo-
cation filters, recently introduced by Xin et al. [98]. A seed location filter efficiently
determines whether a candidate mapping location would result in an incorrect map-
ping before performing the computationally-expensive sequence alignment step for
that location. As long as the filter can eliminate possible locations that would result
in an incorrect mapping faster than the time it takes to perform the alignment, the
entire read mapping process can be substantially accelerated [10, 11, 98, 100]. As
a result, several recent works have focused on optimizing the performance of seed
location filters [10, 11, 95, 98–100].
With the advent of seed location filters, the performance bottleneck of DNA read
mapping has shifted from sequence alignment to seed location filtering [10, 11,
98, 100]. Unfortunately, a seed location filter requires large amounts of memory
bandwidth to process and characterize each of the candidate locations. Our goal is
to reduce the time spent in filtering and thereby improve the speed of DNA read
mapping. To this end, we present a new algorithm, GRIM-Filter , to efficiently filter
locations with high parallelism. We design GRIM-Filter such that it is well-suited
for implementation on 3D-stacked memory, exploiting the parallel and low-latency
processing capability in the logic layer of the memory.
3D-stacked DRAM [2, 4, 6, 12, 42–44, 59, 67, 80] is a new technology that inte-
grates logic and memory in a three-dimensional stack of dies with a large internal
data transfer bandwidth. This enables the bulk transfer of data from each memory
layer to a logic layer that can perform simple parallel operations on the data.
Kim et al. Page 3 of 24
Conventional computing requires the movement of data on the long, slow, and
energy-hungry buses between the CPU processing cores and memory such that cores
can operate on data. In contrast, processing-in-memory (PIM)-enabled devices such
as 3D-stacked memory can perform simple arithmetic operations very close to where
the data resides, with high bandwidth and low latency. With carefully designed
algorithms for PIM, application performance can often be greatly improved (e.g.,
as shown in [6, 42, 43, 92]) because the relatively narrow and long-latency bus
between the CPU cores and memory no longer impedes the speed of computation
on the data.
Our goal is to develop a seed location filter that exploits the high memory
bandwidth and processing-in-memory capabilities of 3D-stacked DRAM to improve
the performance of DNA read mappers.
To our knowledge, this is the first seed location filtering algorithm that accelerates
read mapping by overcoming the memory bottleneck with PIM using 3D-stacked
memory technologies. GRIM-Filter can be used with any read mapper. However, in
this work we demonstrate the effectiveness of GRIM-Filter with a hash table based
mapper, mrFAST with FastHASH [98]. We improve the performance of hash table
based read mappers while maintaining their high sensitivity and comprehensiveness
(which were originally demonstrated in [9]).
Key Mechanism. GRIM-Filter provides a quick method for determining whether
a read will not match at a given location, thus allowing the read mapper to skip
the expensive sequence alignment process for that location. GRIM-Filter works
by counting the existence of small segments of a read in a genome region. If the
count falls under a threshold, indicating that many small segments of a read are
not present, GRIM-Filter discards the locations in that region before alignment.
The existence of all small segments in a region are stored in a bitvector, which can
be easily predetermined for each region of a reference genome. The bitvector for a
reference genome region is retrieved when a read must be checked for a match in
the region. We find that this regional approximation technique not only enables a
high performance boost via high parallelism, but also improves filtering accuracy
over the state-of-the-art. The filtering accuracy improvement comes from the finer
granularity GRIM-Filter uses in counting the subsequences of a read in a region of
a genome, compared to the state-of-the-art filter [98].
Key Results. We evaluate GRIM-Filter qualitatively and quantitatively against
the state-of-the-art seed location filter, FastHASH [98]. Our results show that
GRIM-Filter provides a 5.59x–6.41x smaller false negative rate (i.e., the propor-
tion of locations that pass the filter, but that truly result in a poor match during
sequence alignment) than the best previous filter with zero false positives (i.e., the
number of locations that do not pass the filter, but that truly result in a good match
during sequence alignment). GRIM-Filter provides an end-to-end performance im-
provement of 1.81x–3.65x over a state-of-the-art DNA read mapper, mrFAST with
FastHASH, for a set of real genomic reads, when we use a sequence alignment error
tolerance of 0.05. We also note that as we increase the sequence alignment error
tolerance, the performance improvement of our filter over the state-of-the-art in-
creases. This makes GRIM-Filter more effective and relevant for future-generation
error-prone sequencing technologies, such as nanopore sequencing [28, 87].
2 Motivation and Aim
Mapping reads against a reference genome enables the analysis of the variations in
the sequenced genome. As the throughput of read mapping increases, more large-
scale genome analyses become possible. The ability to deeply characterize and ana-
lyze genomes at a large scale could change medicine from a reactive to a preventative
and further personalized practice. In order to motivate our method for improving the
Kim et al. Page 4 of 24
performance of read mappers, we pinpoint the performance bottlenecks of modern-
day mappers on which we focus our acceleration efforts. We find that across our data
set (see Section 5), a state-of-the-art read mapper, mrFAST with FastHASH [98],
on average, spends 15% of its execution time performing sequence alignment on
locations that are found to be a match, and 59% of its execution time performing
sequence alignment on locations that are discarded because they are not found to
be a match (i.e., false locations).
Our goal is to implement a seed location filter that reduces the wasted computa-
tion time spent performing sequence alignment on such false locations. To this end,
a seed location filter would quickly determine if a location will not match the read
and, if so, it would avoid the sequence alignment altogether. The ideal seed location
filter correctly finds all false locations without increasing the time required to ex-
ecute read mapping. We find that such an ideal seed location filter would improve
the average performance of mrFAST (with FastHASH) by 3.2x. This speedup is
primarily due to the reduced number of false location alignments. In contrast, most
prior works [13–16, 18, 26, 35, 41, 62, 68–70, 81, 82, 96] gain their speedups by im-
plementing all or part of the read mapper in specialized hardware or GPUs, focusing
mainly on the acceleration of the sequence alignment process, not the avoidance of
sequence alignment. These works that accelerate sequence alignment provide or-
thogonal solutions, and could be implemented together with seed location filters,
including GRIM-Filter, for additional performance improvement (see Section 7 for
more detail).
3 GRIM-Filter
We now describe our proposal for a new seed location filter, GRIM-Filter. At a
high level, the key idea of GRIM-Filter is to store and utilize metadata on short
segments of the genome, i.e., segments on the order of several hundred base pairs
long, in order to quickly determine if a read will not result in a match at that
genome segment.
3.1 Genome Metadata Representation
Figure 2 shows a reference genome with its associated metadata that is formatted
for efficient operation by GRIM-Filter. The reference genome is divided into short
contiguous segments, on the order of several hundreds of base pairs, which we re-
fer to as bins. GRIM-Filter operates at the granularity of these bins, performing
analyses on the metadata associated with each bin. This metadata is represented
as a bitvector that stores whether or not a token, i.e., a short DNA sequence on
the order of 5 base pairs, is present within the associated bin. We refer to each bit
in the bitvector as an existence bit. To account for all possible tokens of length n,
each bitvector must be 4n bits in length, where each bit denotes the existence of
a particular token instance. Figure 2 highlights the bits of two token instances of
bin2’s bitvector: it shows that 1) the token GACAG (green) exists in bin2, i.e., the
existence bit associated with the token GACAG is set to 1 in the b2 bitvector; and
2) the token TTTTT (red) is not present in bin2, i.e., the existence bit associated
with the token TTTTT is set to 0 in the b2 bitvector.
Because these bitvectors are associated with the reference genome, the bitvectors
need to be generated only once per reference, and they can be used to map any
number of reads from other individuals of the same species. In order to generate
the bitvectors, the genome must be sequentially scanned for every possible token of
length n, where n is the selected token size. If binx contains the token, the bit in the
bx bitvector corresponding to the token must be set (1). If binx does not contain
the token, then the same bit is left unset (i.e., 0). These bitvectors are saved and
stored for later use when mapping reads to the same reference genome, i.e., they
are part of the genome’s metadata.
Kim et al. Page 5 of 24
Reference
Genome
AAAAA
AAAAC
AAAAG
AAAAT
.
CCCCT
.
.
.
.
GCATG
.
TTGCA
.
TTTTT
1
1
0
0
.
1
.
.
.
.
1
.
1
.
0
0
1
0
.
1
.
1
.
1
.
1
.
.
.
0
AAAAA
AAAAC
AAAAG
.
AGAAA
.
GAAAA
.
GACAG
.
GCATG
.
.
.
TTTTT
y y y y
b1 b2
b2: bitvector
for bin2
1
0
0
0
1
1
1
.
.
.
.
1
1
1
0
0
0
1
0
1
0
1
.
.
.
.
0
1
1
0
1
0
1
1
1
1
1
.
.
.
.
1
0
0
0
AAAAA
AAAAC
AAAAG
AAAAT
AAACA
AAACC
AAACG
.
.
.
.
TTTTA
TTTTC
TTTTG
TTTTT
* t = number of bins
bt-2 bt-1 bt *
L e
n g
t h
 =  
45
GACAG
exists in 
2nd bin
TTTTT 
doesn’t 
exist in 
2nd bin
bin2
bin3
AAAAACCCCTGCCTTGCATGTAGAAAACTTGACAGGAACTTTTTATCGCA
bin1
tokens
(a)
(b)
yyy
bin4
Figure 2: GRIM-Filter has a 2D data structure where each bit at <row, column> indicates if
a token (indexed by the row) is present in the corresponding bin (indicated by the column).
(a) GRIM-Filter divides a genome into overlapping bins. (b) GRIM-Filter’s metadata associ-
ated with a reference genome. Columns are indexed by the bin number of each location. Rows
are indexed by the token value. In this figure, token size=5.
3.2 GRIM-Filter Operation
Before sequence alignment, GRIM-Filter checks each bin to see if the bin contains
a potential mapping location for the read, based on the list of potential locations
provided by the read mapper. If the bin contains a location, GRIM-Filter then
checks the bin to see if the location is likely to match the read sequence, by operating
on the bitvector of the bin. This relies on the entire read being contained within a
given bin, and thus requires the bins to overlap with each other in the construction
of the metadata (i.e., some base pairs are contained in multiple bins), as shown in
Figure 2a.
GRIM-Filter uses the described bitvectors to quickly determine whether a match
within a given error tolerance is impossible. This is done before running the expen-
sive sequence alignment algorithm, in order to reduce the number of unnecessary
sequence alignment operations. For each location associated with a seed, GRIM-
Filter 1) loads the bitvector of the bin containing the location; 2) operates on the
bitvector (as we will describe shortly) to quickly determine if there will be no match
(i.e., a poor match, given the error tolerance threshold); and 3) discards the location
if it determines a poor match. If GRIM-Filter does not discard the location, the
sequence at that location must be aligned with the read to determine the match
similarity.
Using the circled steps in Figure 3, we explain in detail how GRIM-Filter deter-
mines whether to discard a location z for a read sequence r. We use bin num(z)
to indicate the number of the bin that contains location z. GRIM-Filter extracts
every token contained within the read sequence r (¶ in the figure). Then, GRIM-
Filter loads the bitvector for binbin num(z) (·). For each of the tokens contained
in r, GRIM-Filter extracts the existence bit of that token from the bitvector (¸),
to see whether the token exists somewhere within the bin. GRIM-Filter sums all
of the extracted existence bits together (¹), which we refer to as the accumulation
Kim et al. Page 6 of 24
sum for location z (Sumz). The accumulation sum represents the number of tokens
from read sequence r that are present in binbin num(z). A larger accumulation sum
indicates that more tokens from r are present in the bin, and therefore the location
is more likely to contain a match for r. Finally, GRIM-Filter compares Sumz with a
constant accumulation sum threshold value (º), to determine whether location z is
likely to match read sequence r. If Sumz is greater than or equal to the threshold,
then z is likely to match r, and the read mapper must perform sequence alignment
on r to the reference sequence at location z. If Sumz is less than the threshold, then
z will not match r, and the read mapper skips sequence alignment for the location.
We explain how we determine the accumulation sum threshold in Section 3.4.
INPUT: Read Sequence r
GAACTTGGAGTCTACGAG GTACGATT 
GAACTTGGAGTCTACGAG GTACGATT
...
GAACTTGGAGTCTACGAG GTACGATT
GAACTTGGAGTCTACGAG GTACGATT
1
0
1
.
.
0
1
1
.
.
1
0
0
Read 
bitvector for 
bin_num(z)
GAACTTGGAGTCTACGAG GTACGATT
GAACTTGGAGTCTACGAG GTACGATT...
1
+ ≥ Threshold?
Send to
Read Mapper
for Sequence
Alignment
tokens
2
3 4
5
Discard
NO YES
Sumz
Figure 3: Flow diagram for our seed location filtering algorithm. GRIM-Filter takes in a read
sequence and sums the existence of its tokens within a bin to determine whether 1) the read
sequence must be sequence aligned to the reference sequence in the bin or 2) it can be discarded
without alignment. Note that token size = 5 in this example.
Once GRIM-Filter finishes checking each location, it returns control to the read
mapper, which performs sequence alignment on only those locations that pass the
filter. This process is repeated for all seed locations, and it significantly reduces the
number of alignment operations, ultimately reducing the end-to-end read mapping
runtime (as we show in Section 6). Our implementation of GRIM-Filter ensures a
zero false positive rate (i.e., no locations that result in correct mappings for the
read sequence are incorrectly rejected by the filter), as GRIM-Filter passes any
seed location whose bin contains enough of the same tokens as the read sequence.
GRIM-Filter can also account for errors in the sequence, when some of the tokens
do not match perfectly (see Section 3.5). Therefore, using GRIM-Filter to filter out
seed locations does not affect the correctness of the read mapper.
3.3 Integration with a Full Read Mapper
Figure 4 shows how we integrate GRIM-Filter with a read mapper to improve read
mapping performance. Before the read mapper begins sequence alignment, it sends
the read sequence, along with all potential seed locations found in the hash table for
the sequence, to GRIM-Filter. Then, the Filter Bitmask Generator for GRIM-Filter
performs the seed location filtering algorithm we describe in Section 3.2, checking
only the bins that include a potential seed location to see if the bin contains the
same tokens as the read sequence (¶ in Figure 4). For each location, we save the
output of our threshold decision (the computation of which was shown in Figure 3)
as a bit within a seed location filter bitmask, where a 1 means that the location’s
accumulation sum was greater than or equal to the threshold, and a 0 means that
the accumulation sum was less than the threshold. This bitmask is then passed
to the Seed Location Checker (· in Figure 4), which locates the reference segment
Kim et al. Page 7 of 24
corresponding to each seed location that passed the filter (¸) and sends the reference
segment to the read mapper. The read mapper then performs sequence alignment
on only the reference segments it receives from the seed location checker (¹), and
outputs the correct mappings for the read sequence.
GRIM-Filter:
Seed Location Checker
0001010001110001010     010011010... ......
GAACTTGGAGTCTACGAG GTACGATT ...
INPUT: Read Sequence
GRIM-Filter:
Filter Bitmask Generator
(see Figure 3)
Seed Location Filter Bitmask
0001010001110001010     010011010... ......
020128 020131 414415... ... .  .  . ...
KEEP
x
DISCARD KEEP
INPUT: All Potential Seed Locations
Read Mapper:
Sequence Alignment
Reference Segment Storage
Edit-Distance Calculation
reference 
segment
@ 020131
reference 
segment
@ 414415
. . . . . .
OUTPUT: Correct Mappings
1 2
3
4
Figure 4: GRIM-Filter integration with a read mapper. The Filter Bitmask Generator uses
the bitvectors for each bin to determine whether any locations within the bin are potential
matches with the read sequence, and saves potential match information into a Seed Location
Filter Bitmask . The Seed Location Checker uses the bitmask to retrieve the corresponding
reference segments for only those seed locations that match, which are then sent to the read
mapper for sequence alignment.
3.4 Determining the Accumulation Sum Threshold
We now discuss in detail how to determine the threshold used to evaluate the
accumulation sum (Sumz). The threshold is used to determine whether or not a
seed location should be sent to the read mapper for sequence alignment (shown as
º in Figure 3). A greater value of Sumz indicates that the seed location z is more
likely to be a good match for the read sequence r. However, there are cases where
Sumz is high, but the read sequence results in a poor match with the seed location
z. A simple example of this poor match is a read sequence that consists entirely of
“A” base pairs, resulting in 100 AAAAA tokens, and a seed location that consists
entirely of “G” base pairs except for a single AAAAA token. In this example, all
100 AAAAA tokens in the read sequence locate the one AAAAA token in the seed
location, resulting in an accumulation sum of 100, even though the location contains
only one AAAAA token. Because such cases occur, even though they may occur
with low probability, GRIM-Filter cannot guarantee that a high accumulation sum
for a seed location corresponds to a good match with a read sequence. On the other
hand, GRIM-Filter can guarantee that a low accumulation sum (i.e., a sum that
falls under the threshold) indicates that any reference sequence within the bin is a
poor match with the read sequence. This is because a lower sum means that fewer
tokens from the read sequence are present in the bin, which translates directly to
a greater number of errors in a potential match. For a low enough sum, we can
guarantee that the potential read sequence alignment would have too many errors
to be a good match.
3.5 Taking Errors into Account
If a read maps perfectly to a reference sequence in binbin num(z), Sumz would simply
be the total number of tokens in a read, which is read length − (n−1) for a token
size of n. However, to account for insertions, deletions, and substitutions in the
read sequence, sequence alignment has some error tolerance, where a read sequence
Kim et al. Page 8 of 24
and a reference sequence are considered a good match even if some differences
exist. The accumulation sum threshold must account for this error tolerance, so we
reduce the threshold below read length − (n− 1) to allow some tokens to include
errors. Figure 5a shows the equation that we use to calculate the threshold while
accounting for errors. As shown in Figure 5b, a token of size n in a bin overlaps with
n−1 other tokens. We calculate the lowest Sumz possible for a sequence alignment
that includes only a single error (i.e., one insertion, deletion, or substitution) by
studying these n tokens. If the error is an insertion, the insertion shifts at least
one of the n tokens to the right, preserving the shifted token while changing the
remaining tokens (n−1 in the worst case). If the error is a deletion or a substitution,
the change in the worst case can affect all n tokens. Figure 5b shows an example
of how a substitution affects four different tokens, where n = 4. Therefore, for each
error that we tolerate, we must assume the worst-case error (i.e., a deletion or a
substitution), in which case up to n tokens will not match with the read sequence
even when the location actually contains the read sequence.
one substitution error 
affects four tokens
when n = 4
Threshold =  read_length – (n–1)  –
n × ⎡read_length⎤× e
number of errors
allowed per read
maximum
number of tokens
that could contain errors
total number of tokens in a read
(a) (b) single substitution error
Figure 5: (a) Equation to calculate the accumulation sum threshold for a read sequence,
where n is the token length and e is the sequence alignment error tolerance. (b) Impact of a
substitution error on four separate tokens, when n = 4. A single deletion or substitution error
propagates to 4 consecutive tokens, while a single insertion error propagates to 3 consecutive
tokens.
The equation in Figure 5a gives the accumulation sum threshold, accounting for
the worst-case scenario for a sequence alignment error tolerance of e. This means
that the maximum number of allowable errors is equal to the ceiling of the read
size multiplied by the sequence alignment error tolerance. A sequence alignment
error tolerance of e = 0.05 or less is widely used [5, 25, 39, 100]. For each allowable
error, we assume that the worst-case number of tokens (equal to the token length
n) are affected by the error. We also assume the worst case that each error affects a
different set of tokens within the read, which results in the greatest possible number
of tokens that may not match. We calculate this by multiplying the maximum
number of allowable errors by n in the equation. Finally, we subtract the largest
possible number of tokens that may not match from the total number of tokens in
the read sequence, which is read length − (n − 1). This leads to the threshold
value that GRIM-Filter uses to determine the seed locations that the read mapper
should perform sequence alignment on, as discussed in Section 3.2 and shown as º
in Figure 3.
3.6 Candidacy for 3D-Stacked Memory Implementations
We identify three characteristics of the filter bitmask generator in GRIM-Filter
that make it a strong candidate for implementation in 3D-stacked memory: 1) it
requires only very simple operations (e.g., sums and comparisons); 2) it is highly
parallelizable, since each bin can be operated on independently and in parallel; and
3) it is highly memory-bound, requiring a single memory access for approximately
every three computational instructions (we determine this by profiling a software
Kim et al. Page 9 of 24
implementation of GRIM-Filter, i.e., GRIM-Software, which is described in Sec-
tion 5). Next, we describe how we implement GRIM-Filter in 3D-stacked memory.
4 Mapping GRIM-Filter to 3D-Stacked Memory
In this section, we first describe the 3D-stacked DRAM technology (Section 4.1),
which attempts to bridge the well-known disparity between processor speed and
memory bandwidth. Next, we describe how GRIM-Filter can be easily mapped to
utilize this new memory technology (Section 4.2). As the disparity between pro-
cessor speed and memory bandwidth increases, memory becomes more of a bottle-
neck in the computing stack in terms of both performance and energy consump-
tion [6, 45, 73, 77, 78]. Along with 3D-stacked DRAM, which enables much higher
bandwidth and lower latency compared to conventional DRAM, the disparity be-
tween processor and memory is alleviated by the re-emergence of the concept of
Processing-in-Memory (PIM). PIM integrates processing units inside or near the
main memory to 1) leverage high in/near-DRAM bandwidth, and low intra-DRAM
latency; and 2) reduce energy consumption by reducing the amount of data trans-
ferred to and from the processor. In this section, we briefly explain the required
background for these two technologies, which we leverage to implement GRIM-
Filter in a highly-parallel manner.
4.1 3D-Stacked Memory
Main memory is implemented using the DRAM (dynamic random access mem-
ory) technology in today’s systems [51, 52, 74]. Conventional DRAM chips are
connected to the processors using long, slow, and energy-hungry PCB (printed cir-
cuit board) interconnects [51–53, 56, 58, 63, 88]. The conventional DRAM chips do
not incorporate logic to perform computation. For more detail on modern DRAM
operation and architecture, we refer the reader to our previous works (e.g., [20, 22–
24, 36, 37, 50, 52–58, 60, 63, 64, 83]).
3D-stacked DRAM is a new DRAM technology that has a much higher internal
bandwidth than conventional DRAM, thanks to the closer integration of logic and
memory using the through-silicon via (TSV) interconnects, as seen in Figure 6. TSVs
are new, vertical interconnects that can pass through the silicon wafers of a 3D stack
of dies [47, 59, 67]. A TSV has a much smaller feature size than a traditional PCB
interconnect, which enables a 3D-stacked DRAM to integrate hundreds to thousands
of these wired connections between stacked layers. Using this large number of wired
connections, 3D-stacked DRAM can transfer bulk data simultaneously, enabling
much higher bandwidth compared to conventional DRAM. Figure 6 shows a 3D-
stacked DRAM (e.g., High Bandwidth Memory [2, 46]) based system that consists
of four layers of DRAM dies and a logic die stacked together and connected using
TSVs, a processor die, and a silicon interposer that connects the stacked DRAM and
the processor. The vertical connections in the stacked DRAM are very wide and very
short, which results in high bandwidth and low power consumption, respectively [59].
There are many different 3D-stacked DRAM architectures available today. High
Bandwidth Memory (HBM) is already integrated into the AMD Radeon R9 Series
graphics cards [4]. High Bandwidth Memory 2 (HBM2) is integrated in both the new
AMD Radeon RX Vega64 Series graphics cards [3] and the new NVIDIA Tesla P100
GPU accelerators [79]. Hybrid Memory Cube (HMC) is developed by a number of
different contributing companies [12, 44]. Like HBM, HMC also enables a logic layer
underneath the DRAM layers that can perform computation [6, 42, 43]. HMC is
already integrated in the SPARC64 XIfx chip [101]. Other new technologies that
can enable processing-in-memory are also already prototyped in real chips, such as
Micron’s Automata Processor [29] and Tibco transactional application servers [71,
94].
Kim et al. Page 10 of 24
Package Substrate
Interposer
PHY PHY
TSV
MicrobumpHBM DRAM Die
Logic Die
.  .  .
Processor (GPU/CPU/SoC) Die
.   .   .
3D-Stacked DRAM
Figure 6: 3D-stacked DRAM example. High Bandwidth Memory consists of stacked memory
layers (four layers in the picture) and a logic layer connected by high bandwidth through-
silicon vias (TSVs) and microbumps [2, 46, 59]. The 3D-stacked memory is then connected to
a processor die with an interposer layer that provides high-bandwidth between the logic layer
and the processing units on the package substrate.
Processing-in-Memory (PIM). A key technique to improve performance (both
bandwidth and latency) and reduce energy consumption in the memory system
is to place computation units inside the memory system, where the data resides.
Today, we see processing capabilities appearing inside and near DRAM memory
(e.g., in the logic layer of 3D-stacked memory) [6, 7, 17, 19, 21, 30–32, 38, 42, 43,
59, 66, 72, 84, 88–93, 102]. This computation inside or near DRAM significantly
reduces the need to transfer data to/from the processor over the memory bus. PIM
provides significant performance improvement and energy reduction compared to
the conventional system architecture [6–8, 19, 33, 42, 43, 92], which must transfer
all data to/from the processor since the processor is the only entity that performs
all computational tasks.
3D-Stacked DRAM with PIM. The combination of the two new technolo-
gies, 3D-stacked DRAM and PIM, enables very promising opportunities to build
very high-performance and low-power systems. A promising design for 3D-stacked
DRAM consists of multiple stacked memory layers and a tightly-integrated logic
layer that controls the stacked memory, as shown in Figure 6. As many prior works
show [6–8, 19, 33, 42, 43, 59, 66, 67, 103, 104], the logic layer in 3D-stacked DRAM
can be utilized not only for managing the stacked memory layers, but also for inte-
grating application-specific accelerators or simple processing cores. Since the logic
layer already exists and has enough space to integrate computation units, integrat-
ing application-specific accelerators in the logic layer requires modest design and
implementation overhead, and little to no hardware overhead (see [42, 102] for var-
ious analyses). Importantly, the 3D-stacked DRAM architecture enables us to fully
customize the logic layer for the acceleration of applications using processing-in-
memory (i.e., processing in the logic layer) [6, 7, 42, 103].
4.2 Mapping GRIM-Filter to 3D-Stacked Memory with PIM
We find that GRIM-Filter is a very good candidate to implement using processing-
in-memory, as the filter is memory-intensive and performs simple computational
operations (e.g., simple comparisons and additions). Figure 7 shows how we imple-
ment GRIM-Filter in a 3D-stacked memory. The center block shows each layer of an
example 3D-stacked memory architecture, where multiple DRAM layers are stacked
above a logic layer. The layers are connected together with several hundred TSVs,
which enable a high data transfer bandwidth between the layers. Each DRAM layer
Kim et al. Page 11 of 24
is subdivided into multiple banks of memory. A bank in one DRAM layer is con-
nected to banks in the other DRAM layers using the TSVs. These interconnected
banks, along with a slice of the logic layer, are grouped together into a vault. Inside
the 3D-stacked memory, we store the bitvector of each bin (see Section 3) within a
bank as follows: 1) each bit of the bitvector is placed in a different row in a consec-
utive manner (e.g., bit 0 is placed in row 0, bit 1 in row 1, and so on); and 2) all
bits of the bitvector are placed in the same column, and the entire bitvector fits in
the column (e.g., bitvector 0 is placed in column 0, bitvector 1 in column 1, and
so on). We design and place customized logic to perform the GRIM-Filter opera-
tions within each logic layer slice, such that each vault can perform independent
GRIM-Filter operations in parallel with every other vault. Next, we discuss how we
organize the bitvectors within each bank. Afterwards, we discuss the customized
logic required for GRIM-Filter and the associated hardware cost.
DRAM Layers
Logic Layer
TSVs
Bank
B i
t v
e c
t o
r f
o r
 b
i n
 0
B i
t v
e c
t o
r f
o r
 b
i n
 1
B i
t v
e c
t o
r f
o r
 b
i n
 2
B i
t v
e c
t o
r f
o r
 b
i n
 t –
1
Row Buffer
Bank
Row 0: AAAAA
Row 1: AAAAC
Row 2: AAAAG
.
.
.
Row R–1: TTTTT
. . .
Seed Location Filter Bitmask
Row Data Register
I n
c r .
A c
c u
m
u l
a t
o r
C o
m
p a
r a
t o
r
P e
r - B
i n  
L o
g i c
 M
o d
u l e
.  .  ..  .  
Per-Vault
Custom GRIM-Filter Logic
Vault
Figure 7: Left block: GRIM-Filter bitvector layout within a DRAM bank. Center block: 3D-
stacked DRAM with tightly integrated logic layer stacked underneath with TSVs for a high
intra-DRAM data transfer bandwidth. Right block: Custom GRIM-Filter logic placed in the
logic layer, for each vault.
The left block in Figure 7 shows the layout of bitvectors in a single bank. The
bitvectors are written in column order (i.e., column-major order) to the banks, such
that a DRAM access to a row fetches the existence bits of the same token across
many bitvectors (e.g., bitvectors 0 to t − 1 in the example in Figure 7). When
GRIM-Filter reads a row of data from a bank, the DRAM buffers the row within
the bank’s row buffer [53, 56, 75, 76], which resides in the same DRAM layer as
the bank. This data is then copied into a row data register that sits in the logic
layer, from which the GRIM-Filter logic can read the data. This data organization
allows each vault to compute the accumulation sum of multiple bins (e.g., bins 0 to
t−1 in the example) simultaneously. Thus, GRIM-Filter can quickly and efficiently
determine, across many bins,[1] whether a seed location needs to be discarded or
sequence aligned in any of these bins.
The right block in Figure 7 shows the custom hardware logic implemented for
GRIM-Filter in each vault’s logic layer. We design a small logic module for GRIM-
Filter, which consists of only an incrementer, accumulator, and comparator, and
operates on the bitvectorx of a single bin x. The incrementer adds 1 to the value
in the accumulator, which stores the accumulated sum for bin x. In order to hold
the final sum (i.e., Sumz, shown as ¹ in Figure 3), each accumulator must be at
least dlog2(read length)e bits wide. Each comparator must be of the same width
as the accumulator, as the comparator is used to check whether the accumulated
sum exceeds the accumulated sum threshold. Because of the way we arrange the
bitvectors in DRAM, a single read operation in a vault retrieves many (e.g., t)
existence bits in parallel, from many (e.g., t) bitvectors, for the same token. These
existence bits are copied from a DRAM bank’s row buffer into a row data register
[1]In other words, the number of bins that can be accessed in parallel from each bank, times the
number of banks times the number of vaults, within a DRAM chip.
Kim et al. Page 12 of 24
within the logic layer slice of the vault. In order to maximize throughput, we add a
GRIM-Filter logic module for each bin to the logic layer slice. This allows GRIM-
Filter to process all of these existence bits[1] from multiple bitvectors in parallel.
Integration into the System and Low-Level Operation. When GRIM-Filter
starts in the CPU (spawned by a read mapper), it sends a read sequence r to the
in-memory GRIM-Filter logic, along with a range of consecutive bins to check for a
match. GRIM-Filter quickly checks the range of bins to determine whether or not to
discard seed locations within those bins. In the logic layer, the GRIM-Filter Filter
Bitmask Generator (see Section 3.3) iterates through each token in read sequence r.
For each token, GRIM-Filter reads the memory row in each vault that contains the
existence bits for that token, for the bins being checked, into the row buffer inside
the DRAM layer. Then, GRIM-Filter copies the row to the row data register in the
logic layer. Each GRIM-Filter logic module is assigned to a single bin. The logic
module examines the bin’s existence bit in the row buffer, and the incrementer adds
one to the value in the accumulator only if the existence bit is set. This process is
repeated for all tokens in r. Once all of the tokens are processed, each logic module
uses its comparator to check if the accumulator, which now holds the accumulated
sum (Sumz, shown as ¹ in Figure 3) for its assigned bin, is greater than or equal to
the accumulated sum threshold. If Sumz is greater than or equal to the threshold,
a seed location filter bit is set, indicating that the read sequence should be sequence
aligned with the locations in the bin by the read mapper. To maintain the same
amount of parallelism present in the bitvector operations, we place the seed location
filter bits into a seed location filter bitmask , where each logic module writes to one
bit in the bitmask once it performs the accumulator sum threshold comparison.
The seed location filter bitmask is then written to the DRAM layer. Once the Seed
Location Checker (see Section 3.3) starts executing in the CPU, it reads the seed
location filter bitmasks from DRAM, and performs sequence alignment for only
those bits whose seed location filter bits are set to 1.
Hardware Overhead. The hardware overhead of our GRIM-Filter implemen-
tation in 3D-stacked memory depends on the available bandwidth b between a
memory layer and the logic layer. In HBM2 [80], this bandwidth is 4096 bits per
cycle across all vaults (i.e., each clock cycle, 4096 bits from a memory layer can be
copied to the row data registers in the logic layer). GRIM-Filter exploits all of this
parallelism completely, as we can place b GRIM-Filter logic modules (4096 modules
for HBM2) across all vaults within the logic layer. In total, for an HBM2 memory,
and for a read mapper that processes reads consisting of 100 base pairs, GRIM-
Filter requires 4096 incrementer lookup tables (LUTs), 4096 seven-bit counters (a
seven-bit counter can hold the maximum accumulator sum for a 100-base-pair read
sequence), 4096 comparators, and enough buffer space to hold the seed location
filter bitmasks. With a larger bandwidth between the logic and memory layers, we
would be able to compute the seed location filter bits for more bins in parallel, but
this would also incur a larger hardware overhead in the logic layer.
While the read mapper performs sequence alignment on seed locations specified by
one seed location filter bitmask, GRIM-Filter generates seed location filter bitmasks
for a different set of seed locations. We find that a bitmask buffer size of 512 KB
(stored in DRAM) provides enough capacity to ensure that GRIM-Filter and the
read mapper never stall due to a lack of buffer space.
The overall memory footprint (i.e., the amount of storage space required) of the
bitvectors for a reference genome is calculated by multiplying the number of bins
by the size of a single bin. In Section 6.1, we show how we find a set of parameters
that results in an effective filter with a low memory footprint (3.8 GB).
Kim et al. Page 13 of 24
We conclude that GRIM-Filter requires a modest and simple logic layer, which
gives it an advantage over other seed location filtering algorithms that could be
implemented in the logic layer.
5 Experimental Methodology
Evaluated Read Mappers. We evaluate our proposal by incorporating GRIM-
Filter into the state-of-the-art hash table based read mapper, mrFAST with
FastHASH [98]. We choose this mapper for our evaluations as it provides high accu-
racy in the presence of relatively many errors, which is required to detect genomic
variants within and across species [9, 98]. GRIM-Filter plugs in as an extension
to mrFAST, using a simple series of calls to an application programming interface
(API). However, we note that GRIM-Filter can be used with any other read mapper.
We evaluate two read mappers:
• mrFAST with FastHASH [98], which does not use GRIM-Filter;
• GRIM-3D, our 3D-stacked memory implementation of GRIM-Filter combined
with mrFAST and the non-filtering portions of FastHASH.
Major Evaluation Metrics. We report 1) GRIM-Filter’s false negative rate
(i.e., the fraction of locations that pass through the filter but do not contain a
match with the read sequence), and 2) the end-to-end performance improvement of
the read mapper when using GRIM-Filter. We measure the false negative rate of
our filter (and the baseline filter used by the mapper) as the ratio of the number of
locations that passed the filter but did not result in a mapping over all locations that
passed the filter. Note that our implementation of GRIM-Filter ensures a zero false
positive rate (i.e., it does not filter out any correct mappings for the read sequence),
and, thus, GRIM-Filter does not affect the correctness of a read mapper.
Performance Evaluation. We measure the performance improvement of GRIM-
3D by comparing the execution time of our read mappers. We develop a method-
ology to estimate the performance of GRIM-3D, since real hardware systems that
enable in-memory computation are unavailable to us at this point in time. To es-
timate GRIM-3D’s execution time, we need to add up the time spent by three
components (which we denote as tx for component x):
• t1: the time spent on read mapping,
• t2: the time spent on coordinating which bins are examined by GRIM-Filter,
and
• t3: the time spent on applying the filter to each seed.
To obtain t1 and t2, we measure the performance of GRIM-Software, a software-only
version of GRIM-Filter that does not take advantage of processing in 3D-stacked
memory. We run GRIM-Software with mrFAST, and measure:
• GRIM-Software-End-to-End-Time, the end-to-end execution time for read
mapping using GRIM-Software;
• GRIM-Software-Filtering-Time, the time spent only on applying the filter
(i.e., the GRIM-Filter portions of the code shown in Figure 4) using GRIM-
Software.
The values of t1 and t2 are the same for GRIM-Software and GRIM-3D, and
we can compute those by subtracting out the time spent on filtering from
the end-to-end execution time: t1 + t2 = GRIM-Software-End-to-End-Time −
GRIM-Software-Filtering-Time. To estimate t3, we use a validated simulator sim-
ilar to Ramulator [52, 86], which provides us with the time spent by GRIM-3D on
filtering using processing-in-memory. The simulator models the time spent by the
in-memory logic to produce a seed location filter bitmask, and to store the bitmask
into a buffer that is accessible by the read mapper.
Kim et al. Page 14 of 24
Evaluation System. We evaluate the software versions of the read mappers (i.e.,
mrFAST with FastHASH and GRIM-Software) using an Intel(R) Core i7-2600 CPU
running at 3.40GHz [27], with 16GB of DRAM for all experiments.
Data Sets. We used ten real data sets from the 1000 Genomes Project [1]. We
used the same data sets used by Xin et al. [98] for the original evaluation of mrFAST
with FastHASH, in order to provide a fair comparison to our baseline. Table 1 lists
the read length and size of each data set.
ERR240726 1 ERR240727 1 ERR240728 1 ERR240729 1 ERR240730 1
No. of Reads 4031354 4082203 3894290 4013341 4082472
Read Length 100 100 100 100 100
ERR240726 2 ERR240727 2 ERR240728 2 ERR240729 2 ERR240730 2
No. of Reads 4389429 4013341 4013341 4082472 4082472
Read Length 100 100 100 100 100
Table 1: Benchmark data, obtained from the 1000 Genomes Project [1]
Code Availability. The code for GRIM-Filter, GRIM-Software, and our simu-
lator for 3D-stacked DRAM with processing-in-memory is freely available at
https://github.com/CMU-SAFARI/GRIM.
6 Evaluation Results
We first profile the reference human genome in order to 1) determine a range of
parameters that are reasonable to use for GRIM-Filter. We determine the points
of diminishing returns for several parameter values. This data is presented in Sec-
tion 6.1. Using this preliminary data, we reduce the required experiments to a
reasonable range of parameters. Our implementation of GRIM-Filter enables the
variation of runtime parameters (number of bins, token size, error tolerance, etc.)
within the ranges of values that we determine from our experimentation for the best
possible results. We then quantitatively evaluate GRIM-Filter’s improvement in
false negative rate and mapper runtime over the baseline mrFAST with FastHASH
(Section 6.2).
6.1 Sensitivity to GRIM-Filter Parameters
In order to determine a range for the parameters for our experiments, we ran a series
of analyses on the fundamental characteristics of the human reference genome. We
perform these initial experiments to 1) determine effective parameters for GRIM-
Filter and 2) compute its memory footprint . The memory footprint of GRIM-Filter
depends directly on the number of bins that we divide the reference genome into,
since each bin requires a bitvector to hold the token existence bits. Since the bitvec-
tor must contain a Boolean entry for each permutation of the token of size n, each
bitvector must contain 4n bits. The total memory footprint is then obtained by
multiplying the bitvector size by the number of bins. In this section, we sweep the
number of bins, token size, and error tolerance of GRIM-Filter while considering the
memory footprint. To understand how each of the different parameters affect the
performance of GRIM-Filter, we study a sweep on the parameters with a range of
values that result in a memory footprint under 16 GB (which is the current capacity
of HBM2 on state-of-the-art devices [79]).
Average Read Existence. Figure 8 shows how varying a number of different
parameters affects the average read existence across the bins. We define average
read existence to be the ratio of bins with seed locations that pass the filter over
all bins comprising the genome, for a representative set of reads. We would like this
value to be as low as possible because it reflects the filter’s ability to filter incorrect
Kim et al. Page 15 of 24
mappings. A lower average read existence means that fewer bins must be checked
when mapping the representative set of reads. Across the three plots, we vary the
token size from 4 to 6. Within each plot, we vary the number of bins to split the
reference genome into, denoted by the different curves (with different colors and
markers). The X-axis shows the error tolerance that is used, and the Y-axis shows
average read existence. We plot the average and min/max across our 10 data sets
(Table 1) as indicated, respectively, by the triangle and whiskers.
Figure 8: Effect of varying token size, error tolerance, and bin count on average read existence.
We use a representative set of reads to collect this data. A lower value of average read existence
represents a more effective filter. Note that the scale of the Y-axis is different for the three
different graphs.
We make three observations from the figure. First, looking across the three plots,
we observe that increasing the token size from 4 to 5 provides a large (i.e., around
10x) reduction in average read existence, while increasing the token size from 5 to 6
provides a much smaller (i.e., around 2x) reduction in average read existence. The
reduction in average read existence is due to the fact that, in a random pool of As,
Cs, Ts, and Gs, the probability of observing a certain substring of size q is (14 )
q.
Because the distribution of base pairs across a reference genome and across a bin is
not random, a larger token size does not always result in a large decrease, as seen
when changing the token size from 5 to 6. We note that increasing the token size by
one causes GRIM-Filter to use 4x the memory footprint. Second, we observe that
in all three plots (i.e., for all token sizes), an increase in the number of bins results
in a decrease in the average read existence. This is because the bin size decreases
as the number of bins increases, and for smaller bins, we have a smaller sample
size of the reference genome that any given substring could exist within.[2] Third,
we observe that for each plot, increasing the error tolerance results in an increase
in the average read existence. This is due to the fact that if we allow more errors,
fewer tokens of the entire read sequence must be present in a bin for a seed location
from that bin to pass the filter. This increases the probability that a seed location
of a random read passes the filter for a random bin. A poor sequence alignment at
a location that passes the filter is categorized as a false negative. We conclude from
this figure that using tokens of size 5 provides quite good filtering effectiveness (as
measured by average read existence) without requiring as much memory footprint
as using a token size of 6.
False Negative Rate. We choose our final bitvector size after sweeping the num-
ber of bins and the error tolerance (e). Figure 9 shows how varying these parameters
affects the false negative rate of GRIM-Filter. The X-axis varies the number of bins,
while the different lines represent different values of e.
[2]When sweeping the number of bins, we use multiples of 216 because 216 is an even multiple of
the number of TSVs between the logic and memory layers in today’s 3D-stacked memories (today’s
systems typically have 4096 TSVs). We want to use a multiple of 216 so that we can utilize all
TSVs each time we copy data from a row buffer in the memory layer to the corresponding row data
register in the logic layer. This maximizes GRIM-Filter’s internal memory bandwidth utilization
within 3D-stacked memory.
Kim et al. Page 16 of 24
Figure 9: GRIM-Filter’s false negative rate (lower is better) as we vary the number of bins.
We find that increasing the number of bins beyond 300×216 yields diminishing improvements
in the false negative rate, regardless of the error tolerance value.
We make two observations from this figure. First, we find that, with more bins (i.e.,
with a smaller bin size), the false negative rate (i.e., the fraction of locations that
pass the filter, but do not result in a mapping after alignment) decays exponentially.
Above 300×216 bins, we begin to see diminishing returns on the reduction in false
negatives for all error tolerance values. Second, we observe that, as we increase the
error tolerance, regardless of the other parameters, the false negative rate increases.
We also find that the number of bins 1) minimally affects the runtime of GRIM-
Filter (not plotted) and 2) linearly increases the memory footprint. Based on this
study, we choose to use 450×216 bins, which reflects a reasonable memory footprint
(see below) with the other parameters.
Memory Footprint. A larger number of bins results in more bitvectors, so we
must keep this parameter at a reasonable value in order to retain a reasonable mem-
ory footprint for GRIM-Filter. Since we have chosen a token size of 5, GRIM-Filter
requires t bitvectors with a length of 45 = 1024, where t equals the number of bins
we segment the reference genome into. We conclude that employing 450×216 bins
results in the best trade-off between memory footprint, filtering efficiency, and run-
time.[3] This set of parameters results in a total memory footprint of approximately
3.8 GB for storing the bitvectors of this mechanism, which is a very reasonable size
for today’s 3D-stacked memories [2–4, 12, 44, 46, 79].
GRIM-Filter Parallelization. GRIM-Filter operates on every bin indepen-
dently and in parallel, using a separate logic module for each bin. Thus, GRIM-
Filter’s parallelism increases with each additional bin it operates on simultaneously.
We refer to the set of consecutive bins that the GRIM-Filter logic modules are cur-
rently assigned to as the bin window (w). The internal bandwidth of HBM2 [80]
enables copying 4096 bits from a memory layer to the logic layer every cycle, allow-
ing GRIM-Filter to operate on as many as 4096 consecutive bins in parallel (i.e.,
it has a bin window of size w = 4096). GRIM-Filter must only check bin windows
that contain at least one seed location (i.e., a span of 4096 consecutive bins with
zero seed locations does not need to be checked). In contrast, if a consecutive set of
4096 bins contains many seed locations, GRIM-Filter can operate on every bin in
parallel and quickly determine which seed locations within the 4096 bins can safely
[3]We note that the time to generate the bitvectors is not included in our final runtime results,
because these need to be generated only once per reference genome, either by the user or by
the distributor. We find that, with a genome of length L, we can generate the bitvectors in
(9.03e − 08) × L seconds when we use 450×216 bins (this is approximately 5 minutes for the
human genome).
Kim et al. Page 17 of 24
be discarded. In these cases, GRIM-Filter can most effectively utilize the parallelism
available from the 4096 independent logic modules.
In order to understand GRIM-Filter’s ability to parallelize operations on many
bins, we analyze GRIM-Filter when using a bin window of size w = 4096, which
takes advantage of the full memory bandwidth available in HBM2 memory. As we
discuss in Section 3.3, the read mapper generates a list of potential seed locations
for a read sequence, and sends this list to GRIM-Filter when the filter starts. Several
bins, which we call empty bins, do not contain any potential seed locations. When
w = 1, there is only one logic module, and if the module is assigned to an empty
bin, GRIM-Filter immediately moves on to the next bin without computing the
accumulation sum. However, when w = 4096, some, but not all, of the logic modules
may be assigned to empty bins. This happens because in order to simplify the
hardware, GRIM-Filter operates all of the logic modules in lockstep (i.e., the filter
fetches a single row from each bank of memory, which includes the existence bits for
a single token across multiple rows, and all of the logic modules read and process the
existence bits for the same token in the same cycle). Thus, a logic module assigned
to an empty bin must wait for the other logic modules to finish before it can move
onto another bin. As a result, GRIM-Filter with w = 4096 is not 4096x faster than
GRIM-Filter with w = 1. To quantify the benefits of parallelization, we compare the
performance of GRIM-Filter with these two bin window sizes using a representative
set of reads. For 10% of the seeds, we find that GRIM-Filter with w = 4096 reduces
the filtering time by 98.6%, compared to GRIM-Filter with w = 1. For the remaining
seeds, we find that GRIM-Filter with w = 4096 reduces the filtering time by 10–
20%. Thus, even though many of the logic modules are assigned to empty bins in a
given cycle, GRIM-Filter reduces the filtering time by operating on many bins that
contain potential seed locations in parallel.
Overlapping GRIM-Filter Computation with Sequence Alignment in
the CPU. In addition to operating on multiple bins in parallel, one benefit of
implementing GRIM-Filter in 3D-stacked memory is that filtering operations can
be parallelized with sequence alignment that happens on the CPU, since filtering
no longer uses the CPU. Every cycle, for a bin window of size w = 4096, GRIM-
Filter’s Filter Bitmask Generator (¶ in Figure 4) reads 4096 bits from memory, and
updates the accumulation sums for the bins within the bin window that contain a
potential seed location. Once the accumulation sums are computed and compared
against the threshold, GRIM-Filter’s Seed Location Checker (· in Figure 4) can
discard seed locations that map to bins whose accumulation sums do not meet the
threshold (i.e., the seed locations that should not be sent to sequence alignment).
The seed locations that are not discarded are sent to the read mapper for sequence
alignment (¹ in Figure 4), ending GRIM-Filter’s work for the current bin window.
While the read mapper aligns the sequences that passed through the filter from
the completed bin window, GRIM-Filter’s Filter Bitmask Generator moves onto
another bin window, computing the seed location filter bits for a new set of bins. If
GRIM-Filter can exploit enough parallelism, it can provide the CPU with enough
bins to keep the sequence alignment step busy for at least as long as the time needed
for the Filter Bitmask Generator to process the new bin window. This would allow
the filtering latency to overlap completely with alignment, in effect hiding GRIM-
Filter’s latency. We find that a bin window of 4096 bins provides enough parallelism
to completely hide the filtering latency while the read mapper running on the CPU
performs sequence alignment.
6.2 Full Mapper Results
We use a popular seed-and-extend mapper, mrFAST [9], to retrieve all candidate
mappings from the ten real data sets we evaluate (see Section 5). In our experiments,
Kim et al. Page 18 of 24
we use a token size of 5 and 450×216 bins, as discussed in Section 6.1. All remaining
parameters specific to mrFAST are held at the default values across all of our
evaluated read mappers.
False Negative Rate. Figure 10 shows the false negative rate of GRIM-Filter
compared to the baseline FastHASH filter across the ten real data sets we evaluate.
The six plots in the figure show false negative rates for error tolerance values (i.e.,
e) ranging from 0.00 to 0.05, in increments of 0.01. We make three observations
from the figure. First, GRIM-Filter provides a much lower false negative rate than
the baseline FastHASH filter for all data sets and for all error tolerance values. For
an error tolerance of e = 0.05 (shown in the bottom graph),[4] the false negative
rate for GRIM-Filter is 5.97x lower than for FastHASH filter, averaged across all
10 read data sets. Second, GRIM-Filter’s false negative rate 1) increases as the error
tolerance increases from e = 0.00 to e = 0.02, and then 2) decreases as the error
tolerance increases further from e = 0.03 to e = 0.05. There are at least two conflict-
ing reasons. First, as the error tolerance increases, the accumulation sum threshold
decreases (as shown in Figure 5) and thus GRIM-Filter discards fewer locations,
which results in a higher false negative rate. Second, as the error tolerance increases,
the number of acceptable (i.e., correct) mapping locations increases while the num-
ber of candidate locations remains the same, which results in a lower false negative
rate. The interaction of these conflicting reasons results in the initial increase and
the subsequent decrease in the false negative rates that we observe. Third, we ob-
serve that for higher error tolerance values, GRIM-Filter reduces the false negative
rate compared to the FastHASH filter by a larger fraction. This shows that GRIM-
Filter is much more effective at filtering mapping locations when we increase the
error tolerance. We conclude that GRIM-Filter is very effective in reducing the false
negative rate.
Execution Time. Figure 11 compares the execution time of GRIM-3D to that
of mrFAST with FastHASH across all ten different read data sets for the same error
tolerance values used in Figure 10. We make three observations. First, GRIM-3D
improves performance for all of our data sets for all error tolerance values. For an
error tolerance of e = 0.05, the average (maximum) performance improvement is
2.08x (3.65x) across all 10 data sets. Second, as the error tolerance increases, GRIM-
3D’s performance improvement also increases. This is because GRIM-Filter safely
discards many more mapping locations than the FastHASH filter at higher error
tolerance values (as we showed in Figure 10). Thus, GRIM-Filter saves significantly
more execution time than the FastHASH filter by ignoring many more unnecessary
alignments. Third, based on an analysis of the execution time breakdown of GRIM-
3D (not shown), we find that GRIM-3D’s performance gains are mainly due to an
83.7% reduction in the average computation time spent on false negatives, compared
to using the FastHASH filter for seed location filtering. We conclude that employing
GRIM-Filter for seed location filtering in a state-of-the-art read mapper significantly
improves the performance of the read mapper.
7 Related Work
To our knowledge, this is the first paper to exploit 3D-stacked DRAM and its
processing-in-memory capabilities to implement a new seed location filtering algo-
rithm that mitigates the major bottleneck in read mapping, pre-alignment (i.e.,
seed location filtering). In this section, we briefly describe related works that aim to
1) accelerate pre-alignment algorithms, and 2) accelerate sequence alignment with
hardware support.
[4]An error tolerance of e = 0.05 is widely used in alignment during DNA read mapping [5, 25, 39,
100].
Kim et al. Page 19 of 24
0.0
0.1
0.2
0.3
0.4
0.5
0.0
0.1
0.2
0.3
0.4
0.5
0.0
0.1
0.2
0.3
0.4
0.5
0.0
0.1
0.2
0.3
0.4
0.5
0.0
0.1
0.2
0.3
0.4
0.5
F a
l s e
 N
e g
a t
i v e
 R
a t
e
Sequence Alignment
Error Tolerance (e)
e = 0.00
e = 0.01
e = 0.02
e = 0.03
e = 0.04
e = 0.05
FastHASH filter GRIM-Filter
0.0
0.1
0.2
0.3
0.4
0.5
Figure 10: False negative rates of GRIM-Filter and FastHASH filter across ten real data sets
for six different error tolerance values.
Accelerating Pre-Alignment. A very recent prior work [11] implements a seed
location filter in an FPGA, and shows significant speedup against prior filters.
However, as shown in that work, the FPGA is still limited by the memory bandwidth
bottleneck. GRIM-Filter can overcome this bottleneck on an FPGA as well.
Accelerating Sequence Alignment. Another very recent prior work [65] ex-
ploits the high memory bandwidth and the reconfigurable logic layer of 3D-stacked
memory to implement an accelerator for sequence alignment (among other basic
algorithms within the sequence analysis pipeline). Many prior works (e.g., [13–
16, 26, 35, 41, 70, 81, 82, 96]) use FPGAs to also accelerate sequence alignment.
These works accelerate sequence alignment using customized FPGA implementa-
tions of different existing read mapping algorithms. For example, Arram et al. [15]
accelerate the SOAP3 tool on an FPGA engine, achieving up to 134x speedup
compared to BWA [61]. Houtgast et al. [41] present an FPGA-accelerated version
of BWA-MEM that is 3x faster compared to its software implementation. Other
works use GPUs [18, 62, 68, 69] for the same purpose of accelerating sequence
alignment. For example, Liu et al. [62] accelerate BWA and Bowtie by 7.5x and
20x, respectively. In contrast to GRIM-Filter, all of these accelerators focus on ac-
celerating sequence alignment, whereas GRIM-Filter accelerates pre-alignment (i.e.,
seed location filtering). Hence, GRIM-Filter is orthogonal to these works, and can
be combined with any of them for further performance improvement.
Kim et al. Page 20 of 24
0
1
2
3
4
5
0
1
2
3
4
5
0
1
2
3
4
5
0
5
10
15
0
5
10
15
20
25
30
0
10
20
30
40
50
60
70
E x
e c
u t
i o
n  
T i m
e  (
× 1
0 0
0  s
e c
o n
d s
)
Sequence Alignment
Error Tolerance (e)
e = 0.00
e = 0.01
e = 0.02
e = 0.03
e = 0.04
e = 0.05
mrFAST with FastHASH GRIM-3D
Figure 11: Execution time of two mappers, GRIM-3D and mrFAST with FastHASH, across
ten real data sets for six different error tolerance values. Note that the scale of the Y-axis is
different for the six different graphs.
8 Future Work
We have shown that GRIM-Filter significantly reduces the execution time of read
mappers by reducing the number of unnecessary sequence alignments and by taking
advantage of processing-in-memory using 3D-stacked DRAM technology. We believe
there are many other possible applications for employing 3D-stacked DRAM tech-
nology within the genome sequence analysis pipeline (as initially explored in [65]),
and significant additional performance improvements can be obtained by combining
future techniques with GRIM-Filter. Because GRIM-Filter is essentially a seed lo-
cation filter to be employed before sequence alignment during read mapping, it can
be used in any other read mapper along with any other acceleration mechanisms in
the genome sequence analysis pipeline.
We identify three promising major future research directions. We believe it is
promising to 1) explore the benefits of combining GRIM-Filter with other various
read mappers in the field, 2) show the effects of mapping to varying sizes of reference
genomes, and 3) examine how GRIM-Filter can scale to process a greater number
of reads concurrently.
9 Conclusion
This paper introduces GRIM-Filter, a novel algorithm for seed location filtering,
which is a critical performance bottleneck in genome read mapping. GRIM-Filter
Kim et al. Page 21 of 24
has three major novel aspects. First, it preprocesses the reference genome to collect
metadata on large subsequences (i.e., bins) of the genome and stores information on
whether small subsequences (i.e., tokens) are present in each bin. Second, GRIM-
Filter efficiently operates on the metadata to quickly determine whether to discard
a mapping location for a read sequence prior to an expensive sequence alignment,
thereby reducing the number of unnecessary alignments and improving performance.
Third, GRIM-Filter takes advantage of the logic layer within 3D-stacked memory,
which enables the efficient use of processing-in-memory to overcome the memory
bandwidth bottleneck in seed location filtering. We examine the trade-offs for var-
ious parameters in GRIM-Filter, and present a set of parameters that result in
significant performance improvement over the state-of-the-art seed location filter,
FastHASH. When running with a sequence alignment error tolerance of 0.05, we
show that GRIM-Filter 1) filters seed locations with 5.59x–6.41x lower false neg-
ative rates than FastHASH; and 2) improves the performance of the fastest read
mapper, mrFAST with FastHASH, by 1.81x–3.65x. GRIM-Filter is a universal seed
location filter that can be applied to any read mapper.
We believe there is a very promising potential in designing DNA read mapping
algorithms for new memory technologies (like 3D-stacked DRAM) and new pro-
cessing paradigms (like processing-in-memory). We hope that the results from our
paper provides inspiration for other works to design new sequence analysis and other
bioinformatics algorithms that take advantage of new memory technologies and new
processing paradigms, such as processing-in-memory using 3D-stacked DRAM.
Acknowledgments
An earlier version of this paper appears on arXiv.org [49]. An earlier version of this work was presented as a short
talk at RECOMB-Seq [48]. We thank the anonymous reviewers for feedback. This work was supported in part by the
Semiconductor Research Corporation, the National Institutes of Health (grant HG006004 to O. Mutlu and C.
Alkan), Intel, Samsung, and VMware.
Author details
1Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA, USA.
2Department of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA.
3NVIDIA Research, Austin, TX, USA.
4Department of Computer Engineering, Bilkent University, Bilkent, Ankara, Turkey, Bilkent, TR.
5Department of Computer Engineering, TOBB University of Economics and Technology, Sogutozu, TR.
6Department of Computer Science, ETH Zu¨rich, Zu¨rich, CH.
References
1. 1000 Genomes Project Consortium: An Integrated Map of Genetic Variation from 1,092 Human Genomes.
Nature 491(7422), 56–65 (2012)
2. Advanced Micro Devices, Inc.: High Bandwidth Memory — Reinventing Memory Technology.
http://www.amd.com/en-us/innovations/software-technologies/hbm
3. Advanced Micro Devices, Inc.: RadeonTM RX Vega64.
https://gaming.radeon.com/en/product/vega/radeon-rx-vega-64/
4. Advanced Micro Devices, Inc.: AMD RadeonTM R9 Series Graphics Cards with High-Bandwidth Memory.
http://www.amd.com/en-us/products/graphics/desktop/r9/
5. Ahmadi, A., Behm, A., Honnalli, N., Li, C., Weng, L., Xie, X.: Hobbes: Optimized Gram-Based Methods for
Efficient Read Alignment. Nucleic Acids Research 40(6), 41–41 (2012)
6. Ahn, J., Hong, S., Yoo, S., Mutlu, O., Choi, K.: A Scalable Processing-in-Memory Accelerator for Parallel
Graph Processing. In: International Symposium on Computer Architecture, pp. 105–117 (2015)
7. Ahn, J., Yoo, S., Mutlu, O., Choi, K.: PIM-Enabled Instructions: a Low-overhead, Locality-aware
Processing-in-Memory Architecture. In: International Symposium on Computer Architecture, pp. 336–348
(2015)
8. Akin, B., Franchetti, F., Hoe, J.C.: Data Reorganization in Memory Using 3D-Stacked DRAM. In:
International Symposium on Computer Architecture, pp. 131–143 (2015)
9. Alkan, C., Kidd, J.M., Marques-Bonet, T., Aksay, G., Antonacci, F., Hormozdiari, F., Kitzman, J.O., Baker,
C., Malig, M., Mutlu, O., et al.: Personalized Copy Number and Segmental Duplication Maps Using
Next-Generation Sequencing. Nature Genetics 41(10), 1061–1067 (2009)
10. Alser, M., Mutlu, O., Alkan, C.: MAGNET: Understanding and Improving the Accuracy of Genome
Pre-Alignment Filtering. IPSI Transactions on Internet Research (2017)
11. Alser, M., Hassan, H., Xin, H., Ergin, O., Mutlu, O., Alkan, C.: GateKeeper: A New Hardware Architecture
for Accelerating Pre-Alignment in DNA Short Read Mapping. Bioinformatics (2017).
doi:10.1093/bioinformatics/btx342
12. Altera Corporation: Hybrid Memory Cube Controller IP Core User Guide.
https://www.altera.com/en_US/pdfs/literature/ug/ug_hmcc.pdf
13. Aluru, S., Jammula, N.: A Review of Hardware Acceleration for Computational Genomics. IEEE Design &
Test 31(1), 19–30 (2014)
14. Arram, J., Tsoi, K.H., Luk, W., Jiang, P.: Hardware Acceleration of Genetic Sequence Alignment. In:
Reconfigurable Computing: Architectures, Tools and Applications, pp. 13–24 (2013)
Kim et al. Page 22 of 24
15. Arram, J., Tsoi, K.H., Luk, W., Jiang, P.: Reconfigurable Acceleration of Short Read Mapping. In:
International Symposium on Field-Programmable Custom Computing Machines, pp. 210–217 (2013)
16. Ashley, E.A., Butte, A.J., Wheeler, M.T., Chen, R., Klein, T.E., Dewey, F.E., Dudley, J.T., Ormond, K.E.,
Pavlovic, A., Morgan, A.A., et al.: Clinical Assessment Incorporating a Personal Genome. The Lancet
375(9725), 1525–1535 (2010)
17. Babarinsa, O.O., Idreos, S.: JAFAR: Near-Data Processing for Databases. In: International Conference on
Management of Data, pp. 2069–2070 (2015)
18. Blom, J., Jakobi, T., Doppmeier, D., Jaenicke, S., Kalinowski, J., Stoye, J., Goesmann, A.: Exact and
Complete Short-Read Alignment to Microbial Genomes Using Graphics Processing Unit Programming.
Bioinformatics 27(10), 1351–1358 (2011)
19. Boroumand, A., Ghose, S., Lucia, B., Hsieh, K., Malladi, K., Zheng, H., Mutlu, O.: LazyPIM: An Efficient
Cache Coherence Mechanism for Processing-in-Memory. Computer Architecture Letters (2017)
20. Chang, K.K.: Understanding and Improving the Latency of DRAM-Based Memory Systems. PhD thesis,
Carnegie Mellon Univ. (2017)
21. Chang, K.K., Nair, P.J., Lee, D., Ghose, S., Qureshi, M.K., Mutlu, O.: Low-Cost Inter-Linked Subarrays
(LISA): Enabling Fast Inter-Subarray Data Movement in DRAM. In: International Symposium on
High-Performance Computer Architecture, pp. 568–580 (2016)
22. Chang, K.K., Kashyap, A., Hassan, H., Ghose, S., Hsieh, K., Lee, D., Li, T., Pekhimenko, G., Khan, S.,
Mutlu, O.: Understanding Latency Variation in Modern DRAM Chips: Experimental Characterization,
Analysis, and Optimization. In: SIGMETRICS , pp. 323–336 (2016)
23. Chang, K.K., Yaa˘likc¸i, A.G., Ghose, S., Agrawal, A., Chatterjee, N., Kashyap, A., Lee, D., O’Connor, M.,
Hassan, H., Mutlu, O.: Understanding Reduced-Voltage Operation in Modern DRAM Devices:
Experimental Characterization, Analysis, and Mechanisms. (2017)
24. Chang, K.K.-W., Lee, D., Chishti, Z., Alameldeen, A.R., Wilkerson, C., Kim, Y., Mutlu, O.: Improving DRAM
Performance by Parallelizing Refreshes with Accesses. In: International Symposium on High-Performance
Computer Architecture, pp. 356–367 (2014)
25. Cheng, H., Jiang, H., Yang, J., Xu, Y., Shang, Y.: BitMapper: An Efficient All-Mapper Based on Bit-Vector
Computing. BMC Bioinformatics 16(1), 192 (2015)
26. Chiang, J., Studniberg, M., Shaw, J., Seto, S., Truong, K.: Hardware Accelerator for Genomic Sequence
Alignment. In: Engineering in Medicine and Biology Society , pp. 5787–5789 (2006)
27. Corporation, I.: Intel Core i7-2600 Processor. https://ark.intel.com/products/52213
28. David, M., Dursi, L.J., Yao, D., Boutros, P.C., Simpson, J.T.: Nanocall: An Open Source Basecaller for
Oxford Nanopore Sequencing Data. Bioinformatics 33(1), 49–55 (2016)
29. Dlugosch, P., Brown, D., Glendenning, P., Leventhal, M., Noyes, H.: An Efficient and Scalable
Semiconductor Architecture for Parallel Automata Processing. Transactions on Parallel and Distributed
Systems 25(12), 3088–3098 (2014)
30. Farmahini-Farahani, A., Ahn, J.H., Morrow, K., Kim, N.S.: NDA: Near-DRAM Acceleration Architecture
Leveraging Commodity DRAM Devices and Standard Memory Modules. In: International Symposium on
High-Performance Computer Architecture, pp. 283–295 (2015)
31. Gao, M., Kozyrakis, C.: HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing. In:
International Symposium on High-Performance Computer Architecture, pp. 126–137 (2016)
32. Gao, M., Ayers, G., Kozyrakis, C.: Practical Near-Data Processing for In-Memory Analytics Frameworks. In:
International Conference on Parallel Architectures and Compilation Techniques, pp. 113–124 (2015)
33. Guo, Q., Alachiotis, N., Akin, B., Sadi, F., Xu, G., Low, T.M., Pileggi, L., Hoe, J.C., Franchetti, F.:
3D-Stacked Memory-Side Acceleration: Accelerator and System Design. In: Workshop on Near-Data
Processing (2014)
34. Hach, F., Sarrafi, I., Hormozdiari, F., Alkan, C., Eichler, E.E., Sahinalp, S.C.: mrsFAST-Ultra: A Compact,
SNP-Aware Mapper for High Performance Sequencing Applications. Nucleic Acids Research, 370 (2014)
35. Hasan, L., Al-Ars, Z., Vassiliadis, S.: Hardware Acceleration of Sequence Alignment Algorithms–An
Overview. In: Design & Technology of Integrated Systems in Nanoscale Era, pp. 92–97 (2007)
36. Hassan, H., Pekhimenko, G., Vijaykumar, N., Seshadri, V., Lee, D., Ergin, O., Mutlu, O.: ChargeCache:
Reducing DRAM Latency by Exploiting Row Access Locality. In: International Symposium on
High-Performance Computer Architecture, pp. 581–593 (2016)
37. Hassan, H., Vijaykumar, N., Khan, S., Ghose, S., Chang, K., Pekhimenko, G., Lee, D., Ergin, O., Mutlu, O.:
SoftMC: A Flexible and Practical Open-Source Infrastructure for Enabling Experimental DRAM Studies.
In: International Symposium on High-Performance Computer Architecture, pp. 241–252 (2017). IEEE
38. Hassan, S.M., Yalamanchili, S., Mukhopadhyay, S.: Near Data Processing: Impact and Optimization of 3D
Memory System Architecture on the Uncore. In: International Symposium on Memory Systems, pp. 11–21
(2015)
39. Hatem, A., Bozdag˘, D., Toland, A.E., C¸atalyu¨rek, U¨.V.: Benchmarking Short Sequence Mapping Tools.
BMC Bioinformatics 14(1), 184 (2013)
40. Hormozdiari, F., Hach, F., Sahinalp, S.C., Eichler, E.E., Alkan, C.: Sensitive and Fast Mapping of Di-Base
Encoded Reads. Bioinformatics 27(14), 1915–1921 (2011)
41. Houtgast, E.J., Sima, V.-M., Bertels, K., Al-Ars, Z.: An FPGA-Based Systolic Array to Accelerate the
BWA-MEM Genomic Mapping Algorithm. In: Embedded Computer Systems: Architectures, Modeling, and
Simulation, pp. 221–227 (2015)
42. Hsieh, K., Khan, S., Vijaykumar, N., Chang, K.K., Boroumand, A., Ghose, S., Mutlu, O.: Accelerating
Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation. In: International Conference
on Computer Design, pp. 25–32 (2016)
43. Hsieh, K., Ebrahimi, E., Kim, G., Chatterjee, N., O’Connor, M., Vijaykumar, N., Mutlu, O., Keckler, S.W.:
Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in
GPU Systems. In: International Symposium on Computer Architecture, pp. 204–216 (2016)
44. Hybrid Memory Cube Consortium: Hybrid Memory Cube Member Tool Resources.
http://hybridmemorycube.org/tool-resources.html
45. Ipek, E., Mutlu, O., Mart´ınez, J.F., Caruana, R.: Self-Optimizing Memory Controllers: A Reinforcement
Learning Approach. In: International Symposium on Computer Architecture, pp. 39–50 (2008)
46. JEDEC Solid State Technology Association: High Bandwidth Memory (HBM) DRAM. Standard JESD235
(2013)
47. Kim, D.H., Athikulwongse, K., Lim, S.K.: A Study of Through-Silicon-Via Impact on the 3D Stacked IC
Layout. In: International Conference on Computer-Aided Design, pp. 674–680 (2009)
48. Kim, J.S., Senol, D., Xin, H., Lee, D., Alser, M., Hassan, H., Ergin, O., Alkan, C., Mutlu, O.: Genome Read
In-Memory (GRIM) Filter: Fast Location Filtering in DNA Read Mapping with Emerging Memory
Technologies. Presentation at RECOMB Satellite Workshop on Massively Parallel Sequencing (2016)
49. Kim, J.S., Senol, D., Xin, H., Lee, D., Ghose, S., Alser, M., Hassan, H., Ergin, O., Alkan, C., Mutlu, O.:
Kim et al. Page 23 of 24
GRIM-Filter: Fast Seed Filtering in Read Mapping Using Emerging Memory Technologies.
arXiv:1708.04329 (2017)
50. Kim, Y.: Architectural Techniques to Enhance DRAM Scaling. PhD thesis, Carnegie Mellon Univ. (2015)
51. Kim, Y., Mutlu, O.: Memory Systems. In: Computing Handbook, Third Edition: Computer Science and
Software Engineering , (2014)
52. Kim, Y., Yang, W., Mutlu, O.: Ramulator: A Fast and Extensible DRAM Simulator. Computer Architecture
Letters (2015)
53. Kim, Y., Seshadri, V., Lee, D., Liu, J., Mutlu, O.: A Case for Exploiting Subarray-Level Parallelism (SALP)
in DRAM. In: International Symposium on Computer Architecture, pp. 368–379 (2012)
54. Kim, Y., Daly, R., Kim, J., Fallin, C., Lee, J.H., Lee, D., Wilkerson, C., Lai, K., Mutlu, O.: Flipping Bits in
Memory without Accessing Them: An Experimental Study of DRAM Disturbance Errors. In: International
Symposium on Computer Architecture (2014)
55. Lee, D.: Reducing DRAM Energy at Low Cost by Exploiting Heterogeneity. PhD thesis, Carnegie Mellon Univ.
(2016)
56. Lee, D., Kim, Y., Seshadri, V., Liu, J., Subramanian, L., Mutlu, O.: Tiered-Latency DRAM: A Low Latency
and Low Cost DRAM Architecture. In: International Symposium on High-Performance Computer
Architecture (2013)
57. Lee, D., Kim, Y., Pekhimenko, G., Khan, S., Seshadri, V., Chang, K., Mutlu, O.: Adaptive-Latency DRAM:
Optimizing DRAM Timing for the Common-Case. In: International Symposium on High-Performance
Computer Architecture, pp. 489–501 (2015)
58. Lee, D., Subramanian, L., Ausavarungnirun, R., Choi, J., Mutlu, O.: Decoupled Direct Memory Access:
Isolating CPU and IO Traffic by Leveraging a Dual-Data-Port DRAM. In: International Conference on
Parallel Architectures and Compilation Techniques, pp. 174–187 (2015). IEEE
59. Lee, D., Ghose, S., Pekhimenko, G., Khan, S., Mutlu, O.: Simultaneous Multi-Layer Access: Improving
3D-Stacked Memory Bandwidth at Low Cost. Transactions on Architecture and Code Optimization (2016)
60. Lee, D., Khan, S., Subramanian, L., Ghose, S., Ausavarungnirun, R., Pekhimenko, G., Seshadri, V., Mutlu, O.:
Design-Induced Latency Variation in Modern DRAM Chips: Characterization, Analysis, and Latency
Reduction Mechanisms. Proceedings of the ACM on Measurement and Analysis of Computing Systems 1(1),
26 (2017)
61. Li, H., Durbin, R.: Fast and Accurate Long-Read Alignment with Burrows–Wheeler Transform. Bioinformatics
26(5), 589–595 (2010)
62. Liu, C.-M., Wong, T., Wu, E., Luo, R., Yiu, S.-M., Li, Y., Wang, B., Yu, C., Chu, X., Zhao, K., et al.: SOAP3:
Ultra-Fast GPU-Based Parallel Alignment Tool for Short Reads. Bioinformatics 28(6), 878–879 (2012)
63. Liu, J., Jaiyen, B., Veras, R., Mutlu, O.: RAIDR: Retention-Aware Intelligent DRAM Refresh. In:
International Symposium on Computer Architecture (2012)
64. Liu, J., Jaiyen, B., Kim, Y., Wilkerson, C., Mutlu, O.: An Experimental Study of Data Retention Behavior in
Modern DRAM Devices: Implications for Retention Time Profiling Mechanisms. In: International
Symposium on Computer Architecture (2013)
65. Liu, P., Hemani, A., Paul, K., Weis, C., Jung, M., Wehn, N.: 3D-Stacked Many-Core Architecture for
Biological Sequence Analysis Problems. International Journal of Parallel Programming 45(6), 1420–1460
(2017)
66. Liu, Z., Calciu, I., Herlihy, M., Mutlu, O.: Concurrent Data Structures for Near-Memory Computing. In:
Symposium on Parallelism in Algorithms and Architectures, pp. 235–245 (2017)
67. Loh, G.H.: 3D-Stacked Memory Architectures for Multi-Core Processors. In: International Symposium on
Computer Architecture, vol. 36, pp. 453–464 (2008)
68. Luo, R., Wong, T., Zhu, J., Liu, C.-M., Zhu, X., Wu, E., Lee, L.-K., Lin, H., Zhu, W., Cheung, D.W., et al.:
SOAP3-dp: Fast, Accurate and Sensitive GPU-Based Short Read Aligner. PloS One (2013)
69. Manavski, S.A., Valle, G.: CUDA Compatible GPU Cards as Efficient Hardware Accelerators for
Smith-Waterman Sequence Alignment. BMC Bioinformatics 9(Suppl. 2), 10 (2008)
70. McMahon, P.L.: Accelerating Genomic Sequence Alignment Using High Performance Reconfigurable
Computers. PhD thesis, Univ. of California, Berkeley (2008)
71. Micron: Micron Automata Processing. http://www.micronautomata.com/hardware
72. Morad, A., Yavits, L., Ginosar, R.: GP-SIMD Processing-in-Memory. Transactions on Architecture and Code
Optimization 11(4), 53 (2015)
73. Mutlu, O.: Memory Scaling: A Systems Architecture Perspective. In: International Memory Workshop, pp.
21–25 (2013)
74. Mutlu, O.: Main Memory Scaling: Challenges and Solution Directions. In: More than Moore Technologies for
Next Generation Computer Design, pp. 127–153 (2015)
75. Mutlu, O., Moscibroda, T.: Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors. In:
International Symposium on Microarchitecture (2007)
76. Mutlu, O., Moscibroda, T.: Parallelism-Aware Batch Scheduling: Enhancing Both Performance and Fairness
of Shared DRAM Systems. In: International Symposium on Computer Architecture (2008)
77. Mutlu, O., Subramanian, L.: Research Problems and Opportunities in Memory Systems. Supercomputing
Frontiers and Innovations 1(3), 19–55 (2014)
78. Mutlu, O., Stark, J., Wilkerson, C., Patt, Y.N.: Runahead Execution: An Effective Alternative to Large
Instruction Windows. International Symposium on Microarchitecture (6), 20–25 (2003)
79. NVIDIA Corporation: Tesla P100 Data Center Accelerator.
http://www.nvidia.com/object/tesla-p100.html
80. O’Connor, M.: Highlights of the High-Bandwidth Memory (HBM) Standard. In: The Memory Forum (2014)
81. Olson, C.B., Kim, M., Clauson, C., Kogon, B., Ebeling, C., Hauck, S., Ruzzo, W.L.: Hardware Acceleration
of Short Read Mapping. In: International Symposium on Field-Programmable Custom Computing Machines,
pp. 161–168 (2012)
82. Papadopoulos, A., Kirmitzoglou, I., Promponas, V.J., Theocharides, T.: FPGA-Based Hardware Acceleration
for Local Complexity Analysis of Massive Genomic Data. Integration, the VLSI Journal 46(3), 230–239
(2013)
83. Patel, M., Kim, J.S., Mutlu, O.: The Reach Profiler (REAPER): Enabling the Mitigation of DRAM
Retention Failures via Profiling at Aggressive Conditions. In: International Symposium on Computer
Architecture, pp. 255–268 (2017)
84. Pattnaik, A., Tang, X., Jog, A., Kayiran, O., Mishra, A.K., Kandemir, M.T., Mutlu, O., Das, C.R.: Scheduling
Techniques for GPU Architectures with Processing-in-Memory Capabilities. In: International Conference on
Parallel Architectures and Compilation Techniques, pp. 31–44 (2016)
85. Rumble, S.M., Lacroute, P., Dalca, A.V., Fiume, M., Sidow, A., Brudno, M.: SHRiMP: Accurate Mapping of
Short Color-Space Reads. PLoS Computational Biology (2009)
86. SAFARI Research Group: Ramulator: A DRAM Simulator Source Code.
Kim et al. Page 24 of 24
https://github.com/CMU-SAFARI/ramulator
87. Senol, D., Kim, J., Ghose, S., Alkan, C., Mutlu, O.: Nanopore Sequencing Technology and Tools:
Computational Analysis of the Current State, Bottlenecks and Future Directions. In: Pacific Symposium on
Biocomputing Poster Session (2017)
88. Seshadri, V., Mutlu, O.: Simple Operations in Memory to Reduce Data Movement. In: Advances in
Computers, (2017)
89. Seshadri, V., Kim, Y., Fallin, C., Lee, D., Ausavarungnirun, R., Pekhimenko, G., Luo, Y., Mutlu, O., Gibbons,
P.B., Kozuch, M.A., et al.: RowClone: Fast and Energy-Efficient In-DRAM Bulk Data Copy and
Initialization. In: International Symposium on Microarchitecture, pp. 185–197 (2013)
90. Seshadri, V., Hsieh, K., Boroumand, A., Lee, D., Kozuch, M., Mutlu, O., Gibbons, P., Mowry, T.: Fast Bulk
Bitwise AND and OR in DRAM. Computer Architecture Letters (2015)
91. Seshadri, V., Mullins, T., Boroumand, A., Mutlu, O., Gibbons, P.B., Kozuch, M.A., Mowry, T.C.:
Gather-Scatter DRAM: In-DRAM Address Translation to Improve the Spatial Locality of Non-Unit Strided
Accesses. In: International Symposium on Microarchitecture, pp. 267–280 (2015)
92. Seshadri, V., Lee, D., Mullins, T., Hassan, H., Boroumand, A., Kim, J., Kozuch, M.A., Mutlu, O., Gibbons,
P.B., Mowry, T.C.: Ambit: In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM
Technology. In: International Symposium on Microarchitecture (2017)
93. Sura, Z., Jacob, A., Chen, T., Rosenburg, B., Sallenave, O., Bertolli, C., Antao, S., Brunheroto, J., Park, Y.,
O’Brien, K., et al.: Data Access Optimization in a Processing-in-Memory System. In: International
Conference on Computing Frontiers (2015)
94. Tibco: In-Memory Computing. http://www.tibco.com/products/automation/in-memory-computing
95. Tran, N.H., Chen, X.: AMAS: Optimizing the Partition and Filtration of Adaptive Seeds to Speed Up Read
Mapping. arXiv:1502.05041 (2015)
96. Waidyasooriya, H.M., Hariyama, M., Kameyama, M.: FPGA-Accelerator for DNA Sequence Alignment
Based on an Efficient Data-Dependent Memory Access Scheme. Highly-Efficient Accelerators and
Reconfigurable Technologies, 127–130 (2014)
97. Weese, D., Emde, A.-K., Rausch, T., Do¨ring, A., Reinert, K.: RazerS—Fast Read Mapping with Sensitivity
Control. Genome Research 19(9), 1646–1654 (2009)
98. Xin, H., Lee, D., Hormozdiari, F., Yedkar, S., Mutlu, O., Alkan, C.: Accelerating Read Mapping with
FastHASH. BMC Genomics 14(Suppl. 1), 13 (2013)
99. Xin, H., Nahar, S., Zhu, R., Emmons, J., Pekhimenko, G., Kingsford, C., Alkan, C., Mutlu, O.: Optimal Seed
Solver: Optimizing Seed Selection in Read Mapping. Bioinformatics, 1632–1642 (2015)
100. Xin, H., Greth, J., Emmons, J., Pekhimenko, G., Kingsford, C., Alkan, C., Mutlu, O.: Shifted Hamming
Distance: A Fast and Accurate SIMD-Friendly Filter to Accelerate Alignment Verification in Read
Mapping. Bioinformatics, 856 (2015)
101. Yoshida, T.: SPARC64TM XIfx: Fujitsu’s Next Generation Processor for HPC. In: Hot Chips 26 Symposium,
pp. 1–31 (2014)
102. Zhang, D., Jayasena, N., Lyashevsky, A., Greathouse, J.L., Xu, L., Ignatowski, M.: TOP-PIM:
Throughput-Oriented Programmable Processing in Memory. In: International Symposium on
High-Performance Parallel and Distributed Computing , pp. 85–98 (2014)
103. Zhu, Q., Akin, B., Sumbul, H.E., Sadi, F., Hoe, J.C., Pileggi, L., Franchetti, F.: A 3D-Stacked
Logic-in-Memory Accelerator for Application-Specific Data Intensive Computing. In: 3D Systems
Integration Conference, pp. 1–7 (2013)
104. Zhu, Q., Graf, T., Sumbul, H.E., Pileggi, L., Franchetti, F.: Accelerating Sparse Matrix-Matrix
Multiplication with 3D-Stacked Logic-in-Memory Hardware. In: High Performance Extreme Computing
Conference, pp. 1–6 (2013)
