34 research outputs found
RawHash2: Accurate and Fast Mapping of Raw Nanopore Signals using a Hash-based Seeding Mechanism
Summary: Raw nanopore signals can be analyzed while they are being generated,
a process known as real-time analysis. Real-time analysis of raw signals is
essential to utilize the unique features that nanopore sequencing provides,
enabling the early stopping of the sequencing of a read or the entire
sequencing run based on the analysis. The state-of-the-art mechanism, RawHash,
offers the first hash-based efficient and accurate similarity identification
between raw signals and a reference genome by quickly matching their hash
values. In this work, we introduce RawHash2, which provides major improvements
over RawHash, including a more sensitive chaining implementation, weighted
mapping decisions, frequency filters to reduce ambiguous seed hits, minimizers
for hash-based sketching, and support for the R10.4 flow cell version and
various data formats such as POD5. Compared to RawHash, RawHash2 provides
better F1 accuracy (on average by 3.44% and up to 10.32%) and better throughput
(on average by 2.3x and up to 5.4x) than RawHash.
Availability and Implementation: RawHash2 is available at
https://github.com/CMU-SAFARI/RawHash. We also provide the scripts to fully
reproduce our results on our GitHub page
FastRemap: A Tool for Quickly Remapping Reads between Genome Assemblies
A genome read data set can be quickly and efficiently remapped from one
reference to another similar reference (e.g., between two reference versions or
two similar species) using a variety of tools, e.g., the commonly-used CrossMap
tool. With the explosion of available genomic data sets and references,
high-performance remapping tools will be even more important for keeping up
with the computational demands of genome assembly and analysis.
We provide FastRemap, a fast and efficient tool for remapping reads between
genome assemblies. FastRemap provides up to a 7.82 speedup
(6.47, on average) and uses as low as 61.7% (80.7%, on average) of the
peak memory consumption compared to the state-of-the-art remapping tool,
CrossMap.
FastRemap is written in C++. The source code and user manual are freely
available at: github.com/CMU-SAFARI/FastRemap. Docker image available at:
https://hub.docker.com/r/alkanlab/fast. Also available in Bioconda at:
https://anaconda.org/bioconda/fastremap-bio.Comment: FastRemap is open source and all scripts needed to replicate the
results in this paper can be found at https://github.com/CMU-SAFARI/FastRema
TargetCall: Eliminating the Wasted Computation in Basecalling via Pre-Basecalling Filtering
Basecalling is an essential step in nanopore sequencing analysis where the
raw signals of nanopore sequencers are converted into nucleotide sequences,
i.e., reads. State-of-the-art basecallers employ complex deep learning models
to achieve high basecalling accuracy. This makes basecalling
computationally-inefficient and memory-hungry; bottlenecking the entire genome
analysis pipeline. However, for many applications, the majority of reads do no
match the reference genome of interest (i.e., target reference) and thus are
discarded in later steps in the genomics pipeline, wasting the basecalling
computation. To overcome this issue, we propose TargetCall, the first fast and
widely-applicable pre-basecalling filter to eliminate the wasted computation in
basecalling. TargetCall's key idea is to discard reads that will not match the
target reference (i.e., off-target reads) prior to basecalling. TargetCall
consists of two main components: (1) LightCall, a lightweight neural network
basecaller that produces noisy reads; and (2) Similarity Check, which labels
each of these noisy reads as on-target or off-target by matching them to the
target reference. TargetCall filters out all off-target reads before
basecalling; and the highly-accurate but slow basecalling is performed only on
the raw signals whose noisy reads are labeled as on-target. Our thorough
experimental evaluations using both real and simulated data show that
TargetCall 1) improves the end-to-end basecalling performance of the
state-of-the-art basecaller by 3.31x while maintaining high (98.88%)
sensitivity in keeping on-target reads, 2) maintains high accuracy in
downstream analysis, 3) precisely filters out up to 94.71% of off-target reads,
and 4) achieves better performance, sensitivity, and generality compared to
prior works. We freely open-source TargetCall at
https://github.com/CMU-SAFARI/TargetCall
RawHash: Enabling Fast and Accurate Real-Time Analysis of Raw Nanopore Signals for Large Genomes
Nanopore sequencers generate electrical raw signals in real-time while
sequencing long genomic strands. These raw signals can be analyzed as they are
generated, providing an opportunity for real-time genome analysis. An important
feature of nanopore sequencing, Read Until, can eject strands from sequencers
without fully sequencing them, which provides opportunities to computationally
reduce the sequencing time and cost. However, existing works utilizing Read
Until either 1) require powerful computational resources that may not be
available for portable sequencers or 2) lack scalability for large genomes,
rendering them inaccurate or ineffective.
We propose RawHash, the first mechanism that can accurately and efficiently
perform real-time analysis of nanopore raw signals for large genomes using a
hash-based similarity search. To enable this, RawHash ensures the signals
corresponding to the same DNA content lead to the same hash value, regardless
of the slight variations in these signals. RawHash achieves an accurate
hash-based similarity search via an effective quantization of the raw signals
such that signals corresponding to the same DNA content have the same quantized
value and, subsequently, the same hash value.
We evaluate RawHash on three applications: 1) read mapping, 2) relative
abundance estimation, and 3) contamination analysis. Our evaluations show that
RawHash is the only tool that can provide high accuracy and high throughput for
analyzing large genomes in real-time. When compared to the state-of-the-art
techniques, UNCALLED and Sigmap, RawHash provides 1) 25.8x and 3.4x better
average throughput and 2) an average speedup of 32.1x and 2.1x in the mapping
time, respectively.
Source code is available at https://github.com/CMU-SAFARI/RawHash
A Framework for Designing Efficient Deep Learning-Based Genomic Basecallers
Nanopore sequencing generates noisy electrical signals that need to be
converted into a standard string of DNA nucleotide bases using a computational
step called basecalling. The accuracy and speed of basecalling have critical
implications for all later steps in genome analysis. Many researchers adopt
complex deep learning-based models to perform basecalling without considering
the compute demands of such models, which leads to slow, inefficient, and
memory-hungry basecallers. Therefore, there is a need to reduce the computation
and memory cost of basecalling while maintaining accuracy. Our goal is to
develop a comprehensive framework for creating deep learning-based basecallers
that provide high efficiency and performance. We introduce RUBICON, a framework
to develop hardware-optimized basecallers. RUBICON consists of two novel
machine-learning techniques that are specifically designed for basecalling.
First, we introduce the first quantization-aware basecalling neural
architecture search (QABAS) framework to specialize the basecalling neural
network architecture for a given hardware acceleration platform while jointly
exploring and finding the best bit-width precision for each neural network
layer. Second, we develop SkipClip, the first technique to remove the skip
connections present in modern basecallers to greatly reduce resource and
storage requirements without any loss in basecalling accuracy. We demonstrate
the benefits of RUBICON by developing RUBICALL, the first hardware-optimized
basecaller that performs fast and accurate basecalling. Compared to the fastest
state-of-the-art basecaller, RUBICALL provides a 3.96x speedup with 2.97%
higher accuracy. We show that RUBICON helps researchers develop
hardware-optimized basecallers that are superior to expert-designed models
Swordfish: A Framework for Evaluating Deep Neural Network-based Basecalling using Computation-In-Memory with Non-Ideal Memristors
Basecalling, an essential step in many genome analysis studies, relies on
large Deep Neural Networks (DNNs) to achieve high accuracy. Unfortunately,
these DNNs are computationally slow and inefficient, leading to considerable
delays and resource constraints in the sequence analysis process. A
Computation-In-Memory (CIM) architecture using memristors can significantly
accelerate the performance of DNNs. However, inherent device non-idealities and
architectural limitations of such designs can greatly degrade the basecalling
accuracy, which is critical for accurate genome analysis. To facilitate the
adoption of memristor-based CIM designs for basecalling, it is important to (1)
conduct a comprehensive analysis of potential CIM architectures and (2) develop
effective strategies for mitigating the possible adverse effects of inherent
device non-idealities and architectural limitations.
This paper proposes Swordfish, a novel hardware/software co-design framework
that can effectively address the two aforementioned issues. Swordfish
incorporates seven circuit and device restrictions or non-idealities from
characterized real memristor-based chips. Swordfish leverages various
hardware/software co-design solutions to mitigate the basecalling accuracy loss
due to such non-idealities. To demonstrate the effectiveness of Swordfish, we
take Bonito, the state-of-the-art (i.e., accurate and fast), open-source
basecaller as a case study. Our experimental results using Sword-fish show that
a CIM architecture can realistically accelerate Bonito for a wide range of real
datasets by an average of 25.7x, with an accuracy loss of 6.01%.Comment: To appear in 56th IEEE/ACM International Symposium on
Microarchitecture (MICRO), 202
BLEND: A Fast, Memory-Efficient, and Accurate Mechanism to Find Fuzzy Seed Matches
Generating the hash values of short subsequences, called seeds, enables
quickly identifying similarities between genomic sequences by matching seeds
with a single lookup of their hash values. However, these hash values can be
used only for finding exact-matching seeds as the conventional hashing methods
assign distinct hash values for different seeds, including highly similar
seeds. Finding only exact-matching seeds causes either 1) increasing the use of
the costly sequence alignment or 2) limited sensitivity.
We introduce BLEND, the first efficient and accurate mechanism that can
identify both exact-matching and highly similar seeds with a single lookup of
their hash values, called fuzzy seeds matches. BLEND 1) utilizes a technique
called SimHash, that can generate the same hash value for similar sets, and 2)
provides the proper mechanisms for using seeds as sets with the SimHash
technique to find fuzzy seed matches efficiently.
We show the benefits of BLEND when used in read overlapping and read mapping.
For read overlapping, BLEND is faster by 2.6x-63.5x (on average 19.5x), has a
lower memory footprint by 0.9x-9.7x (on average 3.6x), and finds higher quality
overlaps leading to accurate de novo assemblies than the state-of-the-art tool,
minimap2. For read mapping, BLEND is faster by 0.7x-3.7x (on average 1.7x) than
minimap2. Source code is available at https://github.com/CMU-SAFARI/BLEND
GenPIP: In-Memory Acceleration of Genome Analysis via Tight Integration of Basecalling and Read Mapping
Nanopore sequencing is a widely-used high-throughput genome sequencing
technology that can sequence long fragments of a genome into raw electrical
signals at low cost. Nanopore sequencing requires two computationally-costly
processing steps for accurate downstream genome analysis. The first step,
basecalling, translates the raw electrical signals into nucleotide bases (i.e.,
A, C, G, T). The second step, read mapping, finds the correct location of a
read in a reference genome. In existing genome analysis pipelines, basecalling
and read mapping are executed separately. We observe in this work that such
separate execution of the two most time-consuming steps inherently leads to (1)
significant data movement and (2) redundant computations on the data, slowing
down the genome analysis pipeline. This paper proposes GenPIP, an in-memory
genome analysis accelerator that tightly integrates basecalling and read
mapping. GenPIP improves the performance of the genome analysis pipeline with
two key mechanisms: (1) in-memory fine-grained collaborative execution of the
major genome analysis steps in parallel; (2) a new technique for
early-rejection of low-quality and unmapped reads to timely stop the execution
of genome analysis for such reads, reducing inefficient computation. Our
experiments show that, for the execution of the genome analysis pipeline,
GenPIP provides 41.6X (8.4X) speedup and 32.8X (20.8X) energy savings with
negligible accuracy loss compared to the state-of-the-art software genome
analysis tools executed on a state-of-the-art CPU (GPU). Compared to a design
that combines state-of-the-art in-memory basecalling and read mapping
accelerators, GenPIP provides 1.39X speedup and 1.37X energy savings.Comment: 17 pages, 13 figure
Utopia: Fast and Efficient Address Translation via Hybrid Restrictive & Flexible Virtual-to-Physical Address Mappings
Conventional virtual memory (VM) frameworks enable a virtual address to
flexibly map to any physical address. This flexibility necessitates large data
structures to store virtual-to-physical mappings, which leads to high address
translation latency and large translation-induced interference in the memory
hierarchy. On the other hand, restricting the address mapping so that a virtual
address can only map to a specific set of physical addresses can significantly
reduce address translation overheads by using compact and efficient translation
structures. However, restricting the address mapping flexibility across the
entire main memory severely limits data sharing across different processes and
increases data accesses to the swap space of the storage device, even in the
presence of free memory. We propose Utopia, a new hybrid virtual-to-physical
address mapping scheme that allows both flexible and restrictive hash-based
address mapping schemes to harmoniously co-exist in the system. The key idea of
Utopia is to manage physical memory using two types of physical memory
segments: restrictive and flexible segments. A restrictive segment uses a
restrictive, hash-based address mapping scheme that maps virtual addresses to
only a specific set of physical addresses and enables faster address
translation using compact translation structures. A flexible segment employs
the conventional fully-flexible address mapping scheme. By mapping data to a
restrictive segment, Utopia enables faster address translation with lower
translation-induced interference. Utopia improves performance by 24% in a
single-core system over the baseline system, whereas the best prior
state-of-the-art contiguity-aware translation scheme improves performance by
13%.Comment: To appear in 56th IEEE/ACM International Symposium on
Microarchitecture (MICRO), 202