142 research outputs found
Hardware acceleration of the pair HMM algorithm for DNA variant calling
With the advent of several accurate and sophisticated statistical algorithms and pipelines for DNA sequence analysis, it is becoming increasingly possible to translate raw sequencing data into biologically meaningful information for further clinical analysis and processing. However, given the large volume of the data involved, even modestly complex algorithms would require a prohibitively long time to complete. Hence it is urgent to explore non-conventional implementation platforms to accelerate genomics research.
In this thesis, we present a Field-Programmable Gate Array (FPGA) accelerated implementation of the Pair Hidden Markov Model (Pair HMM) forward algorithm, the performance bottleneck in the HaplotypeCaller, a critical function in the popular Genome Analysis Toolkit (GATK) variant calling tool. We introduce the PE ring structure which, thanks to the fine-grained parallelism allowed by the FPGA, can be built into various configurations striking a trade-off between Instruction-Level Parallelism (ILP) and data parallelism. We investigate the resource utilization and performance of different configurations. Our solution can achieve a speed-up of up to 487x compared to the C++ baseline implementation on CPU and 1.56x compared to the previous best hardware implementation
Exploration of GPU acceleration for pair-HMM algorithm and its application in the DNA alignment problem
The hidden Markov model, known as HMM, is an important type of
statistical model with extensive application in estimating hidden parameters and
decoding observed Markov chains.
On top of the HMM, the Pair-HMM Algorithm with Halotype-Caller is
developed as a popular solution for the DNA alignment problem. For two
aligned sequences of DNA observations, one named as reference, and the other
one named as read, there are only three possible hidden states, i.e. match
(A , A),
insertion (- , A), and deletion (A , -). However, what we could observe by
DNA sequencing in real-life is the summation of the possibilities for match,
insertion, and deletion as macrostates. In order to determine the alignment with
maximum probability, we need to score each possible pairwise alignment and
which leads to a computationally intensive problem that usually contributes to
the most latency in a variant calling with the GATK HaplotypeCaller.
In the CPU implementation of a proper Pair-HMM forward algorithm, there
are 7 multiply-accumulate operations for each ( i , j ) location on the
read-reference matrix. Moreover, since transitions and emission matrices are
fixed throughout a single alignment process, a CUDA implementation with
single-precision
floating-point is proposed to accelerate the Pair-HMM forward
algorithm.
CUDA implementation with minibatch and states-parallelization, along with
the use of
float32, gives us an around 22.6x speedup compared to the CPU
implementation. While it comes with a price, using single-precision instead of
double-precision
floating-point introduces a more serious under
flow problem at
the beginning of the alignment scoring process. A normalization technique is
used to help fix this problem.Ope
ApHMM: Accelerating Profile Hidden Markov Models for Fast and Energy-Efficient Genome Analysis
Profile hidden Markov models (pHMMs) are widely employed in various
bioinformatics applications to identify similarities between biological
sequences, such as DNA or protein sequences. In pHMMs, sequences are
represented as graph structures. These probabilities are subsequently used to
compute the similarity score between a sequence and a pHMM graph. The
Baum-Welch algorithm, a prevalent and highly accurate method, utilizes these
probabilities to optimize and compute similarity scores. However, the
Baum-Welch algorithm is computationally intensive, and existing solutions offer
either software-only or hardware-only approaches with fixed pHMM designs. We
identify an urgent need for a flexible, high-performance, and energy-efficient
HW/SW co-design to address the major inefficiencies in the Baum-Welch algorithm
for pHMMs.
We introduce ApHMM, the first flexible acceleration framework designed to
significantly reduce both computational and energy overheads associated with
the Baum-Welch algorithm for pHMMs. ApHMM tackles the major inefficiencies in
the Baum-Welch algorithm by 1) designing flexible hardware to accommodate
various pHMM designs, 2) exploiting predictable data dependency patterns
through on-chip memory with memoization techniques, 3) rapidly filtering out
negligible computations using a hardware-based filter, and 4) minimizing
redundant computations.
ApHMM achieves substantial speedups of 15.55x - 260.03x, 1.83x - 5.34x, and
27.97x when compared to CPU, GPU, and FPGA implementations of the Baum-Welch
algorithm, respectively. ApHMM outperforms state-of-the-art CPU implementations
in three key bioinformatics applications: 1) error correction, 2) protein
family search, and 3) multiple sequence alignment, by 1.29x - 59.94x, 1.03x -
1.75x, and 1.03x - 1.95x, respectively, while improving their energy efficiency
by 64.24x - 115.46x, 1.75x, 1.96x.Comment: Accepted to ACM TAC
Decomposing Genomics Algorithms: Core Computations for Accelerating Genomics
Technological advances in genomic analyses and computing sciences has led to a burst in genomics data. With those advances, there has also been parallel growth in dedicated accelerators for specific genomic analyses. However, biologists are in need of a reconfigurable machine that can allow them to perform multiple analyses without needing to go for dedicated compute platforms for each analysis. This work addresses the first steps in the design of such a reconfigurable machine. We hypothesize that this machine design can consist of some accelerators of computations common across various genomic analyses. This work studies a subset of genomic analyses and identifies such core computations. We further investigate the possibility of further accelerating through a deeper analysis of the computation primitives.National Science Foundation (NSF CNS 13-37732); Infosys; IBM Faculty Award; Office of the Vice Chancellor for Research, University of Illinois at Urbana-ChampaignOpe
Nanopore Sequencing Technology and Tools for Genome Assembly: Computational Analysis of the Current State, Bottlenecks and Future Directions
Nanopore sequencing technology has the potential to render other sequencing
technologies obsolete with its ability to generate long reads and provide
portability. However, high error rates of the technology pose a challenge while
generating accurate genome assemblies. The tools used for nanopore sequence
analysis are of critical importance as they should overcome the high error
rates of the technology. Our goal in this work is to comprehensively analyze
current publicly available tools for nanopore sequence analysis to understand
their advantages, disadvantages, and performance bottlenecks. It is important
to understand where the current tools do not perform well to develop better
tools. To this end, we 1) analyze the multiple steps and the associated tools
in the genome assembly pipeline using nanopore sequence data, and 2) provide
guidelines for determining the appropriate tools for each step. We analyze
various combinations of different tools and expose the tradeoffs between
accuracy, performance, memory usage and scalability. We conclude that our
observations can guide researchers and practitioners in making conscious and
effective choices for each step of the genome assembly pipeline using nanopore
sequence data. Also, with the help of bottlenecks we have found, developers can
improve the current tools or build new ones that are both accurate and fast, in
order to overcome the high error rates of the nanopore sequencing technology.Comment: To appear in Briefings in Bioinformatics (BIB), 201
Bioinformatics
This book is divided into different research areas relevant in Bioinformatics such as biological networks, next generation sequencing, high performance computing, molecular modeling, structural bioinformatics, molecular modeling and intelligent data analysis. Each book section introduces the basic concepts and then explains its application to problems of great relevance, so both novice and expert readers can benefit from the information and research works presented here
Skaalautuvat laskentamenetelmät suuren kapasiteetin sekvensointidatan analytiikkaan populaatiogenomiikassa
High-throughput sequencing (HTS) technologies have enabled rapid DNA sequencing of whole-genomes collected from various organisms and environments, including human tissues, plants, soil, water, and air. As a result, sequencing data volumes have grown by several orders of magnitude, and the number of assembled whole-genomes is increasing rapidly as well. This whole-genome sequencing (WGS) data has revealed the genetic variation in humans and other species, and advanced various fields from human and microbial genomics to drug design and personalized medicine. The amount of sequencing data has almost doubled every six months, creating new possibilities but also big data challenges in genomics. Diverse methods used in modern computational biology require a vast amount of computational power, and advances in HTS technology are even widening the gap between the analysis input data and the analysis outcome.
Currently, many of the existing genomic analysis tools, algorithms, and pipelines are not fully exploiting the power of distributed and high-performance computing, which in turn limits the analysis throughput and restrains the deployment of the applications to clinical practice in the long run. Thus, the relevance of harnessing distributed and cloud computing in bioinformatics is more significant than ever before. Besides, efficient data compression and storage methods for genomic data processing and retrieval integrated with conventional bioinformatics tools are essential. These vast datasets have to be stored and structured in formats that can be managed, processed, searched, and analyzed efficiently in distributed systems.
Genomic data contain repetitive sequences, which is one key property in developing efficient compression algorithms to alleviate the data storage burden. Moreover, indexing compressed sequences appropriately for bioinformatics tools, such as read aligners, offers direct sequence search and alignment capabilities with compressed indexes. Relative Lempel-Ziv (RLZ) has been found to be an efficient compression method for repetitive genomes that complies with the data-parallel computing approach. RLZ has recently been used to build hybrid-indexes compatible with read aligners, and we focus on extending it with distributed computing. Data structures found in genomic data formats have properties suitable for parallelizing routine bioinformatics methods, e.g., sequence matching, read alignment, genome assembly, genotype imputation, and variant calling. Compressed indexing fused with the routine bioinformatics methods and data-parallel computing seems a promising approach to building population-scale genome analysis pipelines. Various data decomposition and transformation strategies are studied for optimizing data-parallel computing performance when such routine bioinformatics methods are executed in a complex pipeline. These novel distributed methods are studied in this dissertation and demonstrated in a generalized scalable bioinformatics analysis pipeline design.
The dissertation starts from the main concepts of genomics and DNA sequencing technologies and builds routine bioinformatics methods on the principles of distributed and parallel computing. This dissertation advances towards designing fully distributed and scalable bioinformatics pipelines focusing on population genomic problems where the input data sets are vast and the analysis results are hard to achieve with conventional computing. Finally, the methods studied are applied in scalable population genomics applications using real WGS data and experimented with in a high performance computing cluster. The experiments include mining virus sequences from human metagenomes, imputing genotypes from large-scale human populations, sequence alignment with compressed pan-genomic indexes, and assembling reference genomes for pan-genomic variant calling.Suuren kapasiteetin sekvensointimenetelmät (High-Throughput Sequencing, HTS) ovat mahdollistaneet kokonaisten genomien nopean ja huokean sekvensoinnin eri organismeista ja ympäristöistä, mukaan lukien kudos-, maaperä-, vesistö- ja ilmastonäytteet. Tämän seurauksena sekvensointidatan ja koostettujen kokogenomien määrät ovat kasvaneet nopeasti. Kokogenomin sekvensointi on lisännyt ihmisen ja muiden lajien geneettisen perimän tietämystä ja edistänyt eri tieteenaloja ympäristötieteistä lääkesuunnitteluun ja yksilölliseen lääketieteeseen. Sekvensointidatan määrä on lähes kaksinkertaistunut puolivuosittain, mikä on luonut uusia mahdollisuuksia läpimurtoihin, mutta myös suuria datankäsittelyn haasteita. Nykyaikaisessa laskennallisessa biologiassa käytettävät monimutkaiset analyysimenetelmät vaativat yhä enemmän laskentatehoa HTS-datan kasvaessa, ja siksi HTS-menetelmien edistyminen kasvattaa kuilua raakadatasta lopullisiin analyysituloksiin.
Useat tällä hetkellä käytetyistä genomianalyysityökaluista, algoritmeista ja ohjelmistoista eivät hyödynnä hajautetun laskennan tehoa kokonaisvaltaisesti, mikä puolestaan hidastaa uusimpien analyysitulosten saamista ja rajoittaa tieteellisten ohjelmistojen käyttöönottoa kliinisessä lääketieteessä pitkällä aikavälillä. Näin ollen hajautetun ja pilvilaskennan hyödyntämisen merkitys bioinformatiikassa on tärkeämpää kuin koskaan ennen. Genomitiedon suoraa hakua ja käsittelyä tukevat pakkaus- ja tallennusmenetelmät mahdollistavat nopean ja tilatehokkaan genomianalytiikan. Uusia hajautettuihin järjestelmiin soveltuvia tietorakenteita tarvitaan, jotta näitä suuria datamääriä voidaan hallita, käsitellä, hakea ja analysoida tehokkaasti.
Genomidata sisältää runsaasti toistuvia sekvenssejä, mikä on yksi keskeinen ominaisuus kehitettäessä tehokkaita pakkausalgoritmeja tiedontallennustaakkaa ja analysointia keventämään. Lisäksi pakattujen sekvenssien indeksointi yhdistettynä sekvenssilinjausmenetelmiin mahdollistaa sekvenssien satunnaishaun ja suoran linjauksen pakattuihin sekvensseihin. Relative Lempel-Ziv (RLZ) pakkausmenetelmä on todettu tehokkaaksi toistuville genomisekvensseille rinnakkaislaskentaa hyödyntäen. RLZ-menetelmää on viime aikoina sovellettu sekvenssilinjaukseen yhteensopiviin hybridi-indekseihin, joita tässä työssä on nopeutettu hajautetulla laskennalla. Genomiikan dataformaateista löytyvillä tietorakenteilla on ominaisuuksia, jotka soveltuvat hajautettuun sekvenssihakuun, sekvenssilinjaukseen, genomien koostamiseen, genotyyppien imputointiin ja varianttien havaitsemiseen. Pakattu indeksointi sovellettuna hajautetulla laskennalla tehostettuihin menetelmiin vaikuttaa lupaavalta lähestymistavalta populaatiogenomiikan analyysiohjelmistojen mukauttamiseksi suuriin datamääriin. Erilaisia tiedon osittamis- ja muunnosstrategioita hyödynnetään suorituskyvyn tehostamiseen monivaiheisessa hajautetussa genomidatan prosessoinnissa. Näitä uusia skaalautuvia hajautettuja laskentamenetelmiä tutkitaan tässä väitöskirjassa ja demonstroidaan yleisluontoisella bioinformatiikan analyysiohjelmiston arkkitehtuurilla.
Tässä työssä johdatellaan genomiikan ja DNA-sekvensointitekniikoiden peruskäsitteisiin ja esitellään rutiininomaisia bioinformatiikan menetelmiä perustuen hajautetun ja rinnakkaislaskennan periaatteille. Väitöskirjassa edetään kohti täysin hajautettujen ja skaalautuvien bioinformatiikan ohjelmistojen suunnittelua keskittyen populaatiogenomiikan ongelmiin, joissa syötedatan määrät ovat suuria ja analyysitulosten saavuttaminen on hidasta tai jopa mahdotonta tavanomaisella laskennalla. Lopuksi tutkittuja menetelmiä sovelletaan tässä työssä kehitettyihin skaalautuviin populaatiogenomiikan sovelluksiin, joita koestetaan kokogenomidatalla supertietokoneen laskentaklusterissa. Kokeet sisältävät virussekvenssien louhintaa ihmisten metagenominäytteistä, genotyyppien täydentämistä (imputointia) suurista ihmispopulaatioista ja pan-genomisen indeksin pakkaamista sekvenssilinjauksen nopeuttamista varten. Lisäksi pakattua pan-genomia kokeillaan referenssigenomin koostamiseen populaatioon perustuvien varianttien havaitsemista varten
- …