10 research outputs found
An In-Memory Architecture for High-Performance Long-Read Pre-Alignment Filtering
With the recent move towards sequencing of accurate long reads, finding
solutions that support efficient analysis of these reads becomes more
necessary. The long execution time required for sequence alignment of long
reads negatively affects genomic studies relying on sequence alignment.
Although pre-alignment filtering as an extra step before alignment was recently
introduced to mitigate sequence alignment for short reads, these filters do not
work as efficiently for long reads. Moreover, even with efficient pre-alignment
filters, the overall end-to-end (i.e., filtering + original alignment)
execution time of alignment for long reads remains high, while the filtering
step is now a major portion of the end-to-end execution time.
Our paper makes three contributions. First, it identifies data movement of
sequences between memory units and computing units as the main source of
inefficiency for pre-alignment filters of long reads. This is because although
filters reject many of these long sequencing pairs before they get to the
alignment stage, they still require a huge cost regarding time and energy
consumption for the large data transferred between memory and processor.
Second, this paper introduces an adaptation of a short-read pre-alignment
filtering algorithm suitable for long reads. We call this LongGeneGuardian.
Finally, it presents Filter-Fuse as an architecture that supports
LongGeneGuardian inside the memory. FilterFuse exploits the
Computation-In-Memory computing paradigm, eliminating the cost of data movement
in LongGeneGuardian.
Our evaluations show that FilterFuse improves the execution time of filtering
by 120.47x for long reads compared to State-of-the-Art (SoTA) filter,
SneakySnake. FilterFuse also improves the end-to-end execution time of sequence
alignment by up to 49.14x and 5207.63x compared to SneakySnake with SoTA
aligner and only SoTA aligner, respectively
MetaTrinity: Enabling Fast Metagenomic Classification via Seed Counting and Edit Distance Approximation
Metagenomics, the study of genome sequences of diverse organisms cohabiting
in a shared environment, has experienced significant advancements across
various medical and biological fields. Metagenomic analysis is crucial, for
instance, in clinical applications such as infectious disease screening and the
diagnosis and early detection of diseases such as cancer. A key task in
metagenomics is to determine the species present in a sample and their relative
abundances. Currently, the field is dominated by either alignment-based tools,
which offer high accuracy but are computationally expensive, or alignment-free
tools, which are fast but lack the needed accuracy for many applications. In
response to this dichotomy, we introduce MetaTrinity, a tool based on
heuristics, to achieve a fundamental improvement in accuracy-runtime tradeoff
over existing methods. We benchmark MetaTrinity against two leading metagenomic
classifiers, each representing different ends of the performance-accuracy
spectrum. On one end, Kraken2, a tool optimized for performance, shows modest
accuracy yet a rapid runtime. The other end of the spectrum is governed by
Metalign, a tool optimized for accuracy. Our evaluations show that MetaTrinity
achieves an accuracy comparable to Metalign while gaining a 4x speedup without
any loss in accuracy. This directly equates to a fourfold improvement in
runtime-accuracy tradeoff. Compared to Kraken2, MetaTrinity requires a 5x
longer runtime yet delivers a 17x improvement in accuracy. This demonstrates a
3.4x enhancement in the accuracy-runtime tradeoff for MetaTrinity. This dual
comparison positions MetaTrinity as a broadly applicable solution for
metagenomic classification, combining advantages of both ends of the spectrum:
speed and accuracy. MetaTrinity is publicly available at
https://github.com/CMU-SAFARI/MetaTrinity
RawHash: Enabling Fast and Accurate Real-Time Analysis of Raw Nanopore Signals for Large Genomes
Nanopore sequencers generate electrical raw signals in real-time while
sequencing long genomic strands. These raw signals can be analyzed as they are
generated, providing an opportunity for real-time genome analysis. An important
feature of nanopore sequencing, Read Until, can eject strands from sequencers
without fully sequencing them, which provides opportunities to computationally
reduce the sequencing time and cost. However, existing works utilizing Read
Until either 1) require powerful computational resources that may not be
available for portable sequencers or 2) lack scalability for large genomes,
rendering them inaccurate or ineffective.
We propose RawHash, the first mechanism that can accurately and efficiently
perform real-time analysis of nanopore raw signals for large genomes using a
hash-based similarity search. To enable this, RawHash ensures the signals
corresponding to the same DNA content lead to the same hash value, regardless
of the slight variations in these signals. RawHash achieves an accurate
hash-based similarity search via an effective quantization of the raw signals
such that signals corresponding to the same DNA content have the same quantized
value and, subsequently, the same hash value.
We evaluate RawHash on three applications: 1) read mapping, 2) relative
abundance estimation, and 3) contamination analysis. Our evaluations show that
RawHash is the only tool that can provide high accuracy and high throughput for
analyzing large genomes in real-time. When compared to the state-of-the-art
techniques, UNCALLED and Sigmap, RawHash provides 1) 25.8x and 3.4x better
average throughput and 2) an average speedup of 32.1x and 2.1x in the mapping
time, respectively.
Source code is available at https://github.com/CMU-SAFARI/RawHash
SequenceLab: A Comprehensive Benchmark of Computational Methods for Comparing Genomic Sequences
Computational complexity is a key limitation of genomic analyses. Thus, over
the last 30 years, researchers have proposed numerous fast heuristic methods
that provide computational relief. Comparing genomic sequences is one of the
most fundamental computational steps in most genomic analyses. Due to its high
computational complexity, optimized exact and heuristic algorithms are still
being developed. We find that these methods are highly sensitive to the
underlying data, its quality, and various hyperparameters. Despite their wide
use, no in-depth analysis has been performed, potentially falsely discarding
genetic sequences from further analysis and unnecessarily inflating
computational costs. We provide the first analysis and benchmark of this
heterogeneity. We deliver an actionable overview of the 11 most widely used
state-of-the-art methods for comparing genomic sequences. We also inform
readers about their advantages and downsides using thorough experimental
evaluation and different real datasets from all major manufacturers (i.e.,
Illumina, ONT, and PacBio). SequenceLab is publicly available at
https://github.com/CMU-SAFARI/SequenceLab
Swordfish: A Framework for Evaluating Deep Neural Network-based Basecalling using Computation-In-Memory with Non-Ideal Memristors
Basecalling, an essential step in many genome analysis studies, relies on
large Deep Neural Networks (DNNs) to achieve high accuracy. Unfortunately,
these DNNs are computationally slow and inefficient, leading to considerable
delays and resource constraints in the sequence analysis process. A
Computation-In-Memory (CIM) architecture using memristors can significantly
accelerate the performance of DNNs. However, inherent device non-idealities and
architectural limitations of such designs can greatly degrade the basecalling
accuracy, which is critical for accurate genome analysis. To facilitate the
adoption of memristor-based CIM designs for basecalling, it is important to (1)
conduct a comprehensive analysis of potential CIM architectures and (2) develop
effective strategies for mitigating the possible adverse effects of inherent
device non-idealities and architectural limitations.
This paper proposes Swordfish, a novel hardware/software co-design framework
that can effectively address the two aforementioned issues. Swordfish
incorporates seven circuit and device restrictions or non-idealities from
characterized real memristor-based chips. Swordfish leverages various
hardware/software co-design solutions to mitigate the basecalling accuracy loss
due to such non-idealities. To demonstrate the effectiveness of Swordfish, we
take Bonito, the state-of-the-art (i.e., accurate and fast), open-source
basecaller as a case study. Our experimental results using Sword-fish show that
a CIM architecture can realistically accelerate Bonito for a wide range of real
datasets by an average of 25.7x, with an accuracy loss of 6.01%.Comment: To appear in 56th IEEE/ACM International Symposium on
Microarchitecture (MICRO), 202
ApHMM: Accelerating Profile Hidden Markov Models for Fast and Energy-Efficient Genome Analysis
Profile hidden Markov models (pHMMs) are widely employed in various
bioinformatics applications to identify similarities between biological
sequences, such as DNA or protein sequences. In pHMMs, sequences are
represented as graph structures. These probabilities are subsequently used to
compute the similarity score between a sequence and a pHMM graph. The
Baum-Welch algorithm, a prevalent and highly accurate method, utilizes these
probabilities to optimize and compute similarity scores. However, the
Baum-Welch algorithm is computationally intensive, and existing solutions offer
either software-only or hardware-only approaches with fixed pHMM designs. We
identify an urgent need for a flexible, high-performance, and energy-efficient
HW/SW co-design to address the major inefficiencies in the Baum-Welch algorithm
for pHMMs.
We introduce ApHMM, the first flexible acceleration framework designed to
significantly reduce both computational and energy overheads associated with
the Baum-Welch algorithm for pHMMs. ApHMM tackles the major inefficiencies in
the Baum-Welch algorithm by 1) designing flexible hardware to accommodate
various pHMM designs, 2) exploiting predictable data dependency patterns
through on-chip memory with memoization techniques, 3) rapidly filtering out
negligible computations using a hardware-based filter, and 4) minimizing
redundant computations.
ApHMM achieves substantial speedups of 15.55x - 260.03x, 1.83x - 5.34x, and
27.97x when compared to CPU, GPU, and FPGA implementations of the Baum-Welch
algorithm, respectively. ApHMM outperforms state-of-the-art CPU implementations
in three key bioinformatics applications: 1) error correction, 2) protein
family search, and 3) multiple sequence alignment, by 1.29x - 59.94x, 1.03x -
1.75x, and 1.03x - 1.95x, respectively, while improving their energy efficiency
by 64.24x - 115.46x, 1.75x, 1.96x.Comment: Accepted to ACM TAC
Reducing the environmental impact of surgery on a global scale: systematic review and co-prioritization with healthcare workers in 132 countries
Abstract
Background
Healthcare cannot achieve net-zero carbon without addressing operating theatres. The aim of this study was to prioritize feasible interventions to reduce the environmental impact of operating theatres.
Methods
This study adopted a four-phase Delphi consensus co-prioritization methodology. In phase 1, a systematic review of published interventions and global consultation of perioperative healthcare professionals were used to longlist interventions. In phase 2, iterative thematic analysis consolidated comparable interventions into a shortlist. In phase 3, the shortlist was co-prioritized based on patient and clinician views on acceptability, feasibility, and safety. In phase 4, ranked lists of interventions were presented by their relevance to high-income countries and low–middle-income countries.
Results
In phase 1, 43 interventions were identified, which had low uptake in practice according to 3042 professionals globally. In phase 2, a shortlist of 15 intervention domains was generated. In phase 3, interventions were deemed acceptable for more than 90 per cent of patients except for reducing general anaesthesia (84 per cent) and re-sterilization of ‘single-use’ consumables (86 per cent). In phase 4, the top three shortlisted interventions for high-income countries were: introducing recycling; reducing use of anaesthetic gases; and appropriate clinical waste processing. In phase 4, the top three shortlisted interventions for low–middle-income countries were: introducing reusable surgical devices; reducing use of consumables; and reducing the use of general anaesthesia.
Conclusion
This is a step toward environmentally sustainable operating environments with actionable interventions applicable to both high– and low–middle–income countries
SequenceLab Datasets
<p>These are the datasets included in the SequenceLab evaluation framework. They were generated based on read sets of the human genome with accession numbers SRR10035390 (Illumina), SRR12519035 (PacBio HiFi), SRR12564436 (Oxford Nanopore Technologies).</p><p>The files are in TSS (Tab Separated Sequences) format. Each line contains a pair of nucleotide sequences, separated by tabs. This simplified format enables evaluating genomic tools with little overhead on real datasets.</p><h4>TSS Specification</h4><ul><li>Each line consists of a pair of nucleotide sequences, separated by a tab character.</li><li>Each line is terminated by a single newline character, i.e. in UNIX style. Windows style linebreaks (carriage return + newline) are <i>not</i> permitted.</li><li>Basepair sequences may consist of uppercase and lowercase nucleic or amino acid codes, as allowed in the <a href="https://zhanggroup.org/FASTA/">FASTA format</a>.</li><li>If the dataset is for a readmapping usecase, the first sequence is the read or query, the second is the reference or target.</li></ul><h4>Methodology</h4><p>The datasets were generated in three steps:</p><ol><li>Each read set was mapped to the T2T CHM13 reference genome using minimap2 once with alignment disabled, resulting in the *_chained and *_mapped datasets, respectively.</li><li>The candidate pairs reported in the resulting .paf files were extracted from the reads and reference, respectively, and written to a .tss file.</li><li>For each .tss file, the shortest 90% and longest 10% of candidate locations were split into separate .tss files named _bottom and _top, respectively. </li></ol>
MetaFast: Enabling Fast Metagenomic Classification via Seed Counting and Edit Distance Approximation
Metagenomics, the study of genome sequences of diverse organisms cohabiting in a shared environment, has experienced significant advancements across various medical and biological fields. Metagenomic analysis is crucial, for instance, in clinical applications such as infectious disease screening and the diagnosis and early detection of diseases such as cancer. A key task in metagenomics is to determine the species present in a sample and their relative abundances. Currently, the field is dominated by either alignment-based tools, which offer high accuracy but are computationally expensive, or alignment-free tools, which are fast but lack the needed accuracy for many applications. In response to this dichotomy, we introduce MetaFast, a tool based on heuristics, to achieve a fundamental improvement in accuracy-runtime tradeoff over existing methods. MetaFast delivers accuracy comparable to the alignment-based and highly accurate tool Metalign but with significantly enhanced efficiency. In MetaFast, we accelerate memory-frugal reference database indexing and filtering. We further employ heuristics to accelerate read mapping. Our evaluation demonstrates that MetaFast achieves a 4x speedup over Metalign without compromising accuracy. MetaFast is publicly available on: https://github.com/CMU-SAFARI/MetaFast