19 research outputs found
SneakySnake: A Fast and Accurate Universal Genome Pre-Alignment Filter for CPUs, GPUs, and FPGAs
Motivation: We introduce SneakySnake, a highly parallel and highly accurate
pre-alignment filter that remarkably reduces the need for computationally
costly sequence alignment. The key idea of SneakySnake is to reduce the
approximate string matching (ASM) problem to the single net routing (SNR)
problem in VLSI chip layout. In the SNR problem, we are interested in finding
the optimal path that connects two terminals with the least routing cost on a
special grid layout that contains obstacles. The SneakySnake algorithm quickly
solves the SNR problem and uses the found optimal path to decide whether or not
performing sequence alignment is necessary. Reducing the ASM problem into SNR
also makes SneakySnake efficient to implement on CPUs, GPUs, and FPGAs.
Results: SneakySnake significantly improves the accuracy of pre-alignment
filtering by up to four orders of magnitude compared to the state-of-the-art
pre-alignment filters, Shouji, GateKeeper, and SHD. For short sequences,
SneakySnake accelerates Edlib (state-of-the-art implementation of Myers's
bit-vector algorithm) and Parasail (state-of-the-art sequence aligner with a
configurable scoring function), by up to 37.7x and 43.9x (>12x on average),
respectively, with its CPU implementation, and by up to 413x and 689x (>400x on
average), respectively, with FPGA and GPU acceleration. For long sequences, the
CPU implementation of SneakySnake accelerates Parasail and KSW2 (sequence
aligner of minimap2) by up to 979x (276.9x on average) and 91.7x (31.7x on
average), respectively. As SneakySnake does not replace sequence alignment,
users can still obtain all capabilities (e.g., configurable scoring functions)
of the aligner of their choice, unlike existing acceleration efforts that
sacrifice some aligner capabilities. Availability:
https://github.com/CMU-SAFARI/SneakySnakeComment: To appear in Bioinformatic
High-Performance Data Mapping for BNNs on PCM-based Integrated Photonics
State-of-the-Art (SotA) hardware implementations of Deep Neural Networks
(DNNs) incur high latencies and costs. Binary Neural Networks (BNNs) are
potential alternative solutions to realize faster implementations without
losing accuracy. In this paper, we first present a new data mapping, called
TacitMap, suited for BNNs implemented based on a Computation-In-Memory (CIM)
architecture. TacitMap maximizes the use of available parallelism, while CIM
architecture eliminates the data movement overhead. We then propose a hardware
accelerator based on optical phase change memory (oPCM) called EinsteinBarrier.
Ein-steinBarrier incorporates TacitMap and adds an extra dimension for
parallelism through wavelength division multiplexing, leading to extra latency
reduction. The simulation results show that, compared to the SotA CIM baseline,
TacitMap and EinsteinBarrier significantly improve execution time by up to
~154x and ~3113x, respectively, while also maintaining the energy consumption
within 60% of that in the CIM baseline.Comment: To appear in Design Automation and Test in Europe (DATE), 202
An In-Memory Architecture for High-Performance Long-Read Pre-Alignment Filtering
With the recent move towards sequencing of accurate long reads, finding
solutions that support efficient analysis of these reads becomes more
necessary. The long execution time required for sequence alignment of long
reads negatively affects genomic studies relying on sequence alignment.
Although pre-alignment filtering as an extra step before alignment was recently
introduced to mitigate sequence alignment for short reads, these filters do not
work as efficiently for long reads. Moreover, even with efficient pre-alignment
filters, the overall end-to-end (i.e., filtering + original alignment)
execution time of alignment for long reads remains high, while the filtering
step is now a major portion of the end-to-end execution time.
Our paper makes three contributions. First, it identifies data movement of
sequences between memory units and computing units as the main source of
inefficiency for pre-alignment filters of long reads. This is because although
filters reject many of these long sequencing pairs before they get to the
alignment stage, they still require a huge cost regarding time and energy
consumption for the large data transferred between memory and processor.
Second, this paper introduces an adaptation of a short-read pre-alignment
filtering algorithm suitable for long reads. We call this LongGeneGuardian.
Finally, it presents Filter-Fuse as an architecture that supports
LongGeneGuardian inside the memory. FilterFuse exploits the
Computation-In-Memory computing paradigm, eliminating the cost of data movement
in LongGeneGuardian.
Our evaluations show that FilterFuse improves the execution time of filtering
by 120.47x for long reads compared to State-of-the-Art (SoTA) filter,
SneakySnake. FilterFuse also improves the end-to-end execution time of sequence
alignment by up to 49.14x and 5207.63x compared to SneakySnake with SoTA
aligner and only SoTA aligner, respectively
Swordfish: A Framework for Evaluating Deep Neural Network-based Basecalling using Computation-In-Memory with Non-Ideal Memristors
Basecalling, an essential step in many genome analysis studies, relies on
large Deep Neural Networks (DNNs) to achieve high accuracy. Unfortunately,
these DNNs are computationally slow and inefficient, leading to considerable
delays and resource constraints in the sequence analysis process. A
Computation-In-Memory (CIM) architecture using memristors can significantly
accelerate the performance of DNNs. However, inherent device non-idealities and
architectural limitations of such designs can greatly degrade the basecalling
accuracy, which is critical for accurate genome analysis. To facilitate the
adoption of memristor-based CIM designs for basecalling, it is important to (1)
conduct a comprehensive analysis of potential CIM architectures and (2) develop
effective strategies for mitigating the possible adverse effects of inherent
device non-idealities and architectural limitations.
This paper proposes Swordfish, a novel hardware/software co-design framework
that can effectively address the two aforementioned issues. Swordfish
incorporates seven circuit and device restrictions or non-idealities from
characterized real memristor-based chips. Swordfish leverages various
hardware/software co-design solutions to mitigate the basecalling accuracy loss
due to such non-idealities. To demonstrate the effectiveness of Swordfish, we
take Bonito, the state-of-the-art (i.e., accurate and fast), open-source
basecaller as a case study. Our experimental results using Sword-fish show that
a CIM architecture can realistically accelerate Bonito for a wide range of real
datasets by an average of 25.7x, with an accuracy loss of 6.01%.Comment: To appear in 56th IEEE/ACM International Symposium on
Microarchitecture (MICRO), 202
BLEND: A Fast, Memory-Efficient, and Accurate Mechanism to Find Fuzzy Seed Matches
Generating the hash values of short subsequences, called seeds, enables
quickly identifying similarities between genomic sequences by matching seeds
with a single lookup of their hash values. However, these hash values can be
used only for finding exact-matching seeds as the conventional hashing methods
assign distinct hash values for different seeds, including highly similar
seeds. Finding only exact-matching seeds causes either 1) increasing the use of
the costly sequence alignment or 2) limited sensitivity.
We introduce BLEND, the first efficient and accurate mechanism that can
identify both exact-matching and highly similar seeds with a single lookup of
their hash values, called fuzzy seeds matches. BLEND 1) utilizes a technique
called SimHash, that can generate the same hash value for similar sets, and 2)
provides the proper mechanisms for using seeds as sets with the SimHash
technique to find fuzzy seed matches efficiently.
We show the benefits of BLEND when used in read overlapping and read mapping.
For read overlapping, BLEND is faster by 2.6x-63.5x (on average 19.5x), has a
lower memory footprint by 0.9x-9.7x (on average 3.6x), and finds higher quality
overlaps leading to accurate de novo assemblies than the state-of-the-art tool,
minimap2. For read mapping, BLEND is faster by 0.7x-3.7x (on average 1.7x) than
minimap2. Source code is available at https://github.com/CMU-SAFARI/BLEND
ApHMM: Accelerating Profile Hidden Markov Models for Fast and Energy-Efficient Genome Analysis
Profile hidden Markov models (pHMMs) are widely employed in various
bioinformatics applications to identify similarities between biological
sequences, such as DNA or protein sequences. In pHMMs, sequences are
represented as graph structures. These probabilities are subsequently used to
compute the similarity score between a sequence and a pHMM graph. The
Baum-Welch algorithm, a prevalent and highly accurate method, utilizes these
probabilities to optimize and compute similarity scores. However, the
Baum-Welch algorithm is computationally intensive, and existing solutions offer
either software-only or hardware-only approaches with fixed pHMM designs. We
identify an urgent need for a flexible, high-performance, and energy-efficient
HW/SW co-design to address the major inefficiencies in the Baum-Welch algorithm
for pHMMs.
We introduce ApHMM, the first flexible acceleration framework designed to
significantly reduce both computational and energy overheads associated with
the Baum-Welch algorithm for pHMMs. ApHMM tackles the major inefficiencies in
the Baum-Welch algorithm by 1) designing flexible hardware to accommodate
various pHMM designs, 2) exploiting predictable data dependency patterns
through on-chip memory with memoization techniques, 3) rapidly filtering out
negligible computations using a hardware-based filter, and 4) minimizing
redundant computations.
ApHMM achieves substantial speedups of 15.55x - 260.03x, 1.83x - 5.34x, and
27.97x when compared to CPU, GPU, and FPGA implementations of the Baum-Welch
algorithm, respectively. ApHMM outperforms state-of-the-art CPU implementations
in three key bioinformatics applications: 1) error correction, 2) protein
family search, and 3) multiple sequence alignment, by 1.29x - 59.94x, 1.03x -
1.75x, and 1.03x - 1.95x, respectively, while improving their energy efficiency
by 64.24x - 115.46x, 1.75x, 1.96x.Comment: Accepted to ACM TAC