227 research outputs found
Hardware acceleration of the pair HMM algorithm for DNA variant calling
With the advent of several accurate and sophisticated statistical algorithms and pipelines for DNA sequence analysis, it is becoming increasingly possible to translate raw sequencing data into biologically meaningful information for further clinical analysis and processing. However, given the large volume of the data involved, even modestly complex algorithms would require a prohibitively long time to complete. Hence it is urgent to explore non-conventional implementation platforms to accelerate genomics research.
In this thesis, we present a Field-Programmable Gate Array (FPGA) accelerated implementation of the Pair Hidden Markov Model (Pair HMM) forward algorithm, the performance bottleneck in the HaplotypeCaller, a critical function in the popular Genome Analysis Toolkit (GATK) variant calling tool. We introduce the PE ring structure which, thanks to the fine-grained parallelism allowed by the FPGA, can be built into various configurations striking a trade-off between Instruction-Level Parallelism (ILP) and data parallelism. We investigate the resource utilization and performance of different configurations. Our solution can achieve a speed-up of up to 487x compared to the C++ baseline implementation on CPU and 1.56x compared to the previous best hardware implementation
Exploration of GPU acceleration for pair-HMM algorithm and its application in the DNA alignment problem
The hidden Markov model, known as HMM, is an important type of
statistical model with extensive application in estimating hidden parameters and
decoding observed Markov chains.
On top of the HMM, the Pair-HMM Algorithm with Halotype-Caller is
developed as a popular solution for the DNA alignment problem. For two
aligned sequences of DNA observations, one named as reference, and the other
one named as read, there are only three possible hidden states, i.e. match
(A , A),
insertion (- , A), and deletion (A , -). However, what we could observe by
DNA sequencing in real-life is the summation of the possibilities for match,
insertion, and deletion as macrostates. In order to determine the alignment with
maximum probability, we need to score each possible pairwise alignment and
which leads to a computationally intensive problem that usually contributes to
the most latency in a variant calling with the GATK HaplotypeCaller.
In the CPU implementation of a proper Pair-HMM forward algorithm, there
are 7 multiply-accumulate operations for each ( i , j ) location on the
read-reference matrix. Moreover, since transitions and emission matrices are
fixed throughout a single alignment process, a CUDA implementation with
single-precision
floating-point is proposed to accelerate the Pair-HMM forward
algorithm.
CUDA implementation with minibatch and states-parallelization, along with
the use of
float32, gives us an around 22.6x speedup compared to the CPU
implementation. While it comes with a price, using single-precision instead of
double-precision
floating-point introduces a more serious under
flow problem at
the beginning of the alignment scoring process. A normalization technique is
used to help fix this problem.Ope
ApHMM: Accelerating Profile Hidden Markov Models for Fast and Energy-Efficient Genome Analysis
Profile hidden Markov models (pHMMs) are widely employed in various
bioinformatics applications to identify similarities between biological
sequences, such as DNA or protein sequences. In pHMMs, sequences are
represented as graph structures. These probabilities are subsequently used to
compute the similarity score between a sequence and a pHMM graph. The
Baum-Welch algorithm, a prevalent and highly accurate method, utilizes these
probabilities to optimize and compute similarity scores. However, the
Baum-Welch algorithm is computationally intensive, and existing solutions offer
either software-only or hardware-only approaches with fixed pHMM designs. We
identify an urgent need for a flexible, high-performance, and energy-efficient
HW/SW co-design to address the major inefficiencies in the Baum-Welch algorithm
for pHMMs.
We introduce ApHMM, the first flexible acceleration framework designed to
significantly reduce both computational and energy overheads associated with
the Baum-Welch algorithm for pHMMs. ApHMM tackles the major inefficiencies in
the Baum-Welch algorithm by 1) designing flexible hardware to accommodate
various pHMM designs, 2) exploiting predictable data dependency patterns
through on-chip memory with memoization techniques, 3) rapidly filtering out
negligible computations using a hardware-based filter, and 4) minimizing
redundant computations.
ApHMM achieves substantial speedups of 15.55x - 260.03x, 1.83x - 5.34x, and
27.97x when compared to CPU, GPU, and FPGA implementations of the Baum-Welch
algorithm, respectively. ApHMM outperforms state-of-the-art CPU implementations
in three key bioinformatics applications: 1) error correction, 2) protein
family search, and 3) multiple sequence alignment, by 1.29x - 59.94x, 1.03x -
1.75x, and 1.03x - 1.95x, respectively, while improving their energy efficiency
by 64.24x - 115.46x, 1.75x, 1.96x.Comment: Accepted to ACM TAC
Recommended from our members
Spatial intratumoral heterogeneity and temporal clonal evolution in esophageal squamous cell carcinoma.
Esophageal squamous cell carcinoma (ESCC) is among the most common malignancies, but little is known about its spatial intratumoral heterogeneity (ITH) and temporal clonal evolutionary processes. To address this, we performed multiregion whole-exome sequencing on 51 tumor regions from 13 ESCC cases and multiregion global methylation profiling for 3 of these 13 cases. We found an average of 35.8% heterogeneous somatic mutations with strong evidence of ITH. Half of the driver mutations located on the branches of tumor phylogenetic trees targeted oncogenes, including PIK3CA, NFE2L2 and MTOR, among others. By contrast, the majority of truncal and clonal driver mutations occurred in tumor-suppressor genes, including TP53, KMT2D and ZNF750, among others. Interestingly, phyloepigenetic trees robustly recapitulated the topological structures of the phylogenetic trees, indicating a possible relationship between genetic and epigenetic alterations. Our integrated investigations of spatial ITH and clonal evolution provide an important molecular foundation for enhanced understanding of tumorigenesis and progression in ESCC
Recommended from our members
Genome variation over multiple timescales and dimensions
Genomic variation does not only include nucleotide changes, it also comprises changes in DNA shape, structure, epigenetic marks, and expression, all of which can occur over generations, cellular differentiation, the span of a few hours or a few millennia. This doctoral thesis explores the implications and opportunities presented by these multiple forms of genomic variation for genome editing, cellular differentiation, genome regulation and comparative genomics, all towards improving our understanding of genome evolution and development and benefiting human health
Decomposing Genomics Algorithms: Core Computations for Accelerating Genomics
Technological advances in genomic analyses and computing sciences has led to a burst in genomics data. With those advances, there has also been parallel growth in dedicated accelerators for specific genomic analyses. However, biologists are in need of a reconfigurable machine that can allow them to perform multiple analyses without needing to go for dedicated compute platforms for each analysis. This work addresses the first steps in the design of such a reconfigurable machine. We hypothesize that this machine design can consist of some accelerators of computations common across various genomic analyses. This work studies a subset of genomic analyses and identifies such core computations. We further investigate the possibility of further accelerating through a deeper analysis of the computation primitives.National Science Foundation (NSF CNS 13-37732); Infosys; IBM Faculty Award; Office of the Vice Chancellor for Research, University of Illinois at Urbana-ChampaignOpe
Hidden Markov Models and their Applications in Biological Sequence Analysis
Hidden Markov models (HMMs) have been extensively used in biological sequence analysis. In this paper, we give a tutorial review of HMMs and their applications in a variety of problems in molecular biology. We especially focus on three types of HMMs: the profile-HMMs, pair-HMMs, and context-sensitive HMMs. We show how these HMMs can be used to solve various sequence analysis problems, such as pairwise and multiple sequence alignments, gene annotation, classification, similarity search, and many others
Nanopore Sequencing Technology and Tools for Genome Assembly: Computational Analysis of the Current State, Bottlenecks and Future Directions
Nanopore sequencing technology has the potential to render other sequencing
technologies obsolete with its ability to generate long reads and provide
portability. However, high error rates of the technology pose a challenge while
generating accurate genome assemblies. The tools used for nanopore sequence
analysis are of critical importance as they should overcome the high error
rates of the technology. Our goal in this work is to comprehensively analyze
current publicly available tools for nanopore sequence analysis to understand
their advantages, disadvantages, and performance bottlenecks. It is important
to understand where the current tools do not perform well to develop better
tools. To this end, we 1) analyze the multiple steps and the associated tools
in the genome assembly pipeline using nanopore sequence data, and 2) provide
guidelines for determining the appropriate tools for each step. We analyze
various combinations of different tools and expose the tradeoffs between
accuracy, performance, memory usage and scalability. We conclude that our
observations can guide researchers and practitioners in making conscious and
effective choices for each step of the genome assembly pipeline using nanopore
sequence data. Also, with the help of bottlenecks we have found, developers can
improve the current tools or build new ones that are both accurate and fast, in
order to overcome the high error rates of the nanopore sequencing technology.Comment: To appear in Briefings in Bioinformatics (BIB), 201
- ā¦