183 research outputs found

    Efficient Storage of Genomic Sequences in High Performance Computing Systems

    Get PDF
    ABSTRACT: In this dissertation, we address the challenges of genomic data storage in high performance computing systems. In particular, we focus on developing a referential compression approach for Next Generation Sequence data stored in FASTQ format files. The amount of genomic data available for researchers to process has increased exponentially, bringing enormous challenges for its efficient storage and transmission. General-purpose compressors can only offer limited performance for genomic data, thus the need for specialized compression solutions. Two trends have emerged as alternatives to harness the particular properties of genomic data: non-referential and referential compression. Non-referential compressors offer higher compression rations than general purpose compressors, but still below of what a referential compressor could theoretically achieve. However, the effectiveness of referential compression depends on selecting a good reference and on having enough computing resources available. This thesis presents one of the first referential compressors for FASTQ files. We first present a comprehensive analytical and experimental evaluation of the most relevant tools for genomic raw data compression, which led us to identify the main needs and opportunities in this field. As a consequence, we propose a novel compression workflow that aims at improving the usability of referential compressors. Subsequently, we discuss the implementation and performance evaluation for the core of the proposed workflow: a referential compressor for reads in FASTQ format that combines local read-to-reference alignments with a specialized binary-encoding strategy. The compression algorithm, named UdeACompress, achieved very competitive compression ratios when compared to the best compressors in the current state of the art, while showing reasonable execution times and memory use. In particular, UdeACompress outperformed all competitors when compressing long reads, typical of the newest sequencing technologies. Finally, we study the main aspects of the data-level parallelism in the Intel AVX-512 architecture, in order to develop a parallel version of the UdeACompress algorithms to reduce the runtime. Through the use of SIMD programming, we managed to significantly accelerate the main bottleneck found in UdeACompress, the Suffix Array Construction

    BRIDES: A New Fast Algorithm and Software for Characterizing Evolving Similarity Networks Using Breakthroughs, Roadblocks, Impasses, Detours, Equals and Shortcuts

    Get PDF
    International audienceVarious types of genome and gene similarity networks along with their characteristics have been increasingly used for retracing different kinds of evolutionary and ecological relationships. Here, we present a new polynomial time algorithm and the corresponding software (BRIDES) to provide characterization of different types of paths existing in evolving (or augmented) similarity networks under the constraint that such paths contain at least one node that was not present in the original network. These different paths are denoted as Breakthroughs , Roadblocks, Impasses, Detours, Equal paths, and Shortcuts. The analysis of their distribution can allow discriminating among different evolutionary hypotheses concerning genomes or genes at hand. Our approach is based on an original application of the popular shortest path Dijkstra's and Yen's algorithms

    Efficient Algorithms for Prokaryotic Whole Genome Assembly and Finishing

    Get PDF
    De-novo genome assembly from DNA fragments is primarily based on sequence overlap information. In addition, mate-pair reads or paired-end reads provide linking information for joining gaps and bridging repeat regions. Genome assemblers in general assemble long contiguous sequences (contigs) using both overlapping reads and linked reads until the assembly runs into an ambiguous repeat region. These contigs are further bridged into scaffolds using linked read information. However, errors can be made in both phases of assembly due to high error threshold of overlap acceptance and linking based on too few mate reads. Identical as well as similar repeat regions can often cause errors in overlap and mate-pair evidence. In addition, the problem of setting the correct threshold to minimize errors and optimize assembly of reads is not trivial and often requires a time-consuming trial and error process to obtain optimal results. The typical trial-and-error with multiple assembler, which can be computationally intensive, and is very inefficient, especially when users must learn how to use a wide variety of assemblers, many of which may be serial requiring long execution time and will not return usable or accurate results. Further, we show that the comparison of assembly results may not provide the users with a clear winner under all circumstances. Therefore, we propose a novel scaffolding tool, Correlative Algorithm for Repeat Placement (CARP), capable of joining short low error contigs using mate pair reads, computationally resolved repeat structures and synteny with one or more reference organisms. The CARP tool requires a set of repeat sequences such as insertion sequences (IS) that can be found computationally found without assembling the genome. Development of methods to identify such repeating regions directly from raw sequence reads or draft genomes led to the development of the ISQuest software package. ISQuest identifies bacterial ISs and their sequence elements—inverted and direct repeats—in raw read data or contigs using flexible search parameters. ISQuest is capable of finding ISs in hundreds of partially assembled genomes within hours; making it a valuable high-throughput tool for a global search of IS and repeat elements. The CARP tool matches very low error contigs with strong overlap using the ambiguous partial repeat sequence at the ends of the contig annotated using the repeat sequences discovered using ISQuest. These matches are verified by synteny with genomes of one or more reference organisms. We show that the CARP tool can be used to verify low mate pair evidence regions, independently find new joins and significantly reduce the number of scaffolds. Finally, we are demonstrate a novel viewer that presents to the user the computationally derived joins along with the evidence used to make the joins. The viewer allows the user to independently assess their confidence in the joins made by the finishing tools and make an informed decision of whether to invest the resources necessary to confirm a particular portion of the assembly. Further, we allow users to manually record join evidence, re-order contigs, and track the assembly finishing process

    Mandrake : visualizing microbial population structure by embedding millions of genomes into a low-dimensional representation

    Get PDF
    In less than a decade, population genomics of microbes has progressed from the effort of sequencing dozens of strains to thousands, or even tens of thousands of strains in a single study. There are now hundreds of thousands of genomes available even for a single bacterial species, and the number of genomes is expected to continue to increase at an accelerated pace given the advances in sequencing technology and widespread genomic surveillance initiatives. This explosion of data calls for innovative methods to enable rapid exploration of the structure of a population based on different data modalities, such as multiple sequence alignments, assemblies and estimates of gene content across different genomes. Here, we present Mandrake, an efficient implementation of a dimensional reduction method tailored for the needs of large-scale population genomics. Mandrake is capable of visualizing population structure from millions of whole genomes, and we illustrate its usefulness with several datasets representing major pathogens. Our method is freely available both as an analysis pipeline (https://github.com/johnlees/mandrake) and as a browser-based interactive application (https://gtonkinhill.github.io/mandrake-web/).This article is part of a discussion meeting issue 'Genomic population structures of microbial pathogens'.Peer reviewe

    A highly efficient multi-core algorithm for clustering extremely large datasets

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>In recent years, the demand for computational power in computational biology has increased due to rapidly growing data sets from microarray and other high-throughput technologies. This demand is likely to increase. Standard algorithms for analyzing data, such as cluster algorithms, need to be parallelized for fast processing. Unfortunately, most approaches for parallelizing algorithms largely rely on network communication protocols connecting and requiring multiple computers. One answer to this problem is to utilize the intrinsic capabilities in current multi-core hardware to distribute the tasks among the different cores of one computer.</p> <p>Results</p> <p>We introduce a multi-core parallelization of the k-means and k-modes cluster algorithms based on the design principles of transactional memory for clustering gene expression microarray type data and categorial SNP data. Our new shared memory parallel algorithms show to be highly efficient. We demonstrate their computational power and show their utility in cluster stability and sensitivity analysis employing repeated runs with slightly changed parameters. Computation speed of our Java based algorithm was increased by a factor of 10 for large data sets while preserving computational accuracy compared to single-core implementations and a recently published network based parallelization.</p> <p>Conclusions</p> <p>Most desktop computers and even notebooks provide at least dual-core processors. Our multi-core algorithms show that using modern algorithmic concepts, parallelization makes it possible to perform even such laborious tasks as cluster sensitivity and cluster number estimation on the laboratory computer.</p

    FisherMP: fully parallel algorithm for detecting combinatorial motifs from large ChIP-seq datasets.

    Get PDF
    Detecting binding motifs of combinatorial transcription factors (TFs) from chromatin immunoprecipitation sequencing (ChIP-seq) experiments is an important and challenging computational problem for understanding gene regulations. Although a number of motif-finding algorithms have been presented, most are either time consuming or have sub-optimal accuracy for processing large-scale datasets. In this article, we present a fully parallelized algorithm for detecting combinatorial motifs from ChIP-seq datasets by using Fisher combined method and OpenMP parallel design. Large scale validations on both synthetic data and 350 ChIP-seq datasets from the ENCODE database showed that FisherMP has not only super speeds on large datasets, but also has high accuracy when compared with multiple popular methods. By using FisherMP, we successfully detected combinatorial motifs of CTCF, YY1, MAZ, STAT3 and USF2 in chromosome X, suggesting that they are functional co-players in gene regulation and chromosomal organization. Integrative and statistical analysis of these TF-binding peaks clearly demonstrate that they are not only highly coordinated with each other, but that they are also correlated with histone modifications. FisherMP can be applied for integrative analysis of binding motifs and for predicting cis-regulatory modules from a large number of ChIP-seq datasets

    Crystallographic Analysis and Molecular Modeling Studies of HIV-1 Protease and Drug Resistant Mutants

    Get PDF
    HIV-1 protease (PR) is an effective target protein for drugs in anti-retroviral therapy (ART). Using PR inhibitors (PIs) in clinical therapy successfully reduces mortality of HIV infected patients. However, drug resistant variants are selected in AIDS patients because of the fast evolution of the viral genome. Structural, kinetic and MD simulations of PR variants with or without substrate or PIs were used to better understand the molecular basis of drug resistance. Information obtained from these extensive studies will benefit the design of more effective inhibitor in ART. Amprenavir (APV) inhibition of PRWT, and single mutants of PRV32I, PRI50V, PRI54M, PRI54V, PRI84V and PRL90M were studied and X-ray crystal structures of PR variants complexes with APV were solved at resolutions of 1.02-1.85 Å to identify structural alterations. Crystal structures of PRWT, PRV32I and PRI47V were solved at resolutions of 1.20-1.40 Å. Reaction intermediates were captured in the substrate binding cavity, which represent three consecutive steps in the catalytic reaction of HIV PR. HIV-1 PR20 variant is a multi-drug resistant variant from a clinical isolate and it is of utility to investigate the mechanisms of resistance. The crystal structures of PR20 with inactivating mutation D25N have been determined at 1.45-1.75 Å resolution, and three distinct flap conformations, open, twisted and tucked, were observed. These studies help understand molecular basis of drug resistance and provide clues for design of inhibitors to combat multi-drug resistant PR. The evaluation of electrostatic force in MD simulations is the computationally intensive work, which is of order theta(N2) with integration of all atom pairs. AMMP invokes Amortized FMM in summation of electrostatic force, which reduced work load to theta(N). A hybrid, CPU and GPU, parallel implementation of Amortized FMM was developed and improves the elapsed time of MD simulation 20 fold faster than CPU based parallelization

    Root Digger: a root placement program for phylogenetic trees

    Get PDF
    Background In phylogenetic analysis, it is common to infer unrooted trees. However, knowing the root location is desirable for downstream analyses and interpretation. There exist several methods to recover a root, such as molecular clock analysis (including midpoint rooting) or rooting the tree using an outgroup. Non-reversible Markov models can also be used to compute the likelihood of a potential root position. Results We present a software called RootDigger which uses a non-reversible Markov model to compute the most likely root location on a given tree and to infer a confidence value for each possible root placement. We find that RootDigger is successful at finding roots when compared to similar tools such as IQ-TREE and MAD, and will occasionally outperform them. Additionally, we find that the exhaustive mode of RootDigger is useful in quantifying and explaining uncertainty in rooting positions. Conclusions RootDigger can be used on an existing phylogeny to find a root, or to asses the uncertainty of the root placemen

    High-Performance approaches for Phylogenetic Placement, and its application to species and diversity quantification

    Get PDF
    In den letzten Jahren haben Fortschritte in der Hochdurchsatz-Genesequenzierung, in Verbindung mit dem anhaltenden exponentiellen Wachstum und der VerfĂŒgbarkeit von Rechenressourcen, zu fundamental neuen analytischen AnsĂ€tzen in der Biologie gefĂŒhrt. Es ist nun möglich den genetischen Inhalt ganzer Organismengemeinschaften anhand einzelner Umweltproben umfassend zu sequenzieren. Solche Methoden sind besonders fĂŒr die Mikrobiologie relevant. Die Mikrobiologie war zuvor weitgehend auf die Untersuchung jener Mikroben beschrĂ€nkt, welche im Labor (d.h., in vitro) kultiviert werden konnten, was jedoch lediglich einen kleinen Teil der in der Natur vorkommenden DiversitĂ€t abdeckt. Im Gegensatz dazu ermöglicht die Hochdurchsatzsequenzierung nun die direkte Erfassung der genetischen Sequenzen eines Mikrobioms, wie es in seiner natĂŒrlichen Umgebung vorkommt (d.h., in situ). Ein typisches Ziel von Mikrobiomstudien besteht in der taxonomischen Klassifizierung der in einer Probe enthaltenen Sequenzen (Querysequenzen). Üblicherweise werden phylogenetische Methoden eingesetzt, um detaillierte taxonomische Beziehungen zwischen Querysequenzen und vertrauenswĂŒrdigen Referenzsequenzen, die von bereits klassifizierten Organismen stammen, zu bestimmen. Aufgrund des hohen Volumens (106 10 ^ 6 bis 109 10 ^ 9 ) von Querysequenzen, die aus einer Mikrobiom-Probe mittels Hochdurchsatzsequenzierung generiert werden können, ist eine akkurate phylogenetische Baumrekonstruktion rechnerisch nicht mehr möglich. DarĂŒber hinaus erzeugen derzeit ĂŒblicherweise verwendete Sequenzierungstechnologien vergleichsweise kurze Sequenzen, die ein begrenztes phylogenetisches Signal aufweisen, was zu einer InstabilitĂ€t bei der Inferenz der Phylogenien aus diesen Sequenzen fĂŒhrt. Ein weiteres typisches Ziel von Mikrobiomstudien besteht in der Quantifizierung der DiversitĂ€t innerhalb einer Probe, bzw. zwischen mehreren Proben. Auch hierfĂŒr werden ĂŒblicherweise phylogenetische Methoden verwendet. Oftmals setzen diese Methoden die Inferenz eines phylogenetischen Baumes voraus, welcher entweder alle Sequenzen, oder eine geclusterte Teilmenge dieser Sequenzen, umfasst. Wie bei der taxonomischen Identifizierung können Analysen, die auf dieser Art von Bauminferenz basieren, zu ungenauen Ergebnissen fĂŒhren und/oder rechnerisch nicht durchfĂŒhrbar sein. Im Gegensatz zu einer umfassenden phylogenetischen Inferenz ist die phylogenetische Platzierung eine Methode, die den phylogenetischen Kontext einer Querysequenz innerhalb eines etablierten Referenzbaumes bestimmt. Dieses Verfahren betrachtet den Referenzbaum typischerweise als unverĂ€nderlich, d.h. der Referenzbaum wird vor, wĂ€hrend oder nach der Platzierung einer Sequenz nicht geĂ€ndert. Dies erlaubt die phylogenetische Platzierung einer Sequenz in linearer Zeit in Bezug auf die GrĂ¶ĂŸe des Referenzbaums durchzufĂŒhren. In Kombination mit taxonomischen Informationen ĂŒber die Referenzsequenzen ermöglicht die phylogenetische Platzierung somit die taxonomische Identifizierung einer Sequenz. DarĂŒber hinaus erlaubt eine phylogenetische Platzierung die Anwendung einer Vielzahl zusĂ€tzlicher Analyseverfahren, die beispielsweise die Zuordnung der Zusammensetzungen humaner Mikrobiome zu klinisch-diagnostischen Eigenschaften ermöglicht. In dieser Dissertation prĂ€sentiere ich meine Arbeit bezĂŒglich des Entwurfs, der Implementierung, und Verbesserung von EPA-ng, einer Hochleistungsimplementierung der phylogenetischen Platzierung anhand des Maximum-Likelihood Modells. EPA-ng wurde entwickelt um auf Milliarden von Querysequenzen zu skalieren und auf Tausenden von Kernen in Systemen mit gemeinsamem und verteiltem Speicher ausgefĂŒhrt zu werden. EPA-ng beschleunigt auch die Verarbeitungsgeschwindigkeit auf einzelnen Kernen um das bis zu 3030-fache, im Vergleich zu dessen direkten Konkurrenzprogrammen. Vor kurzem haben wir eine zusĂ€tzliche Methode fĂŒr EPA-ng eingefĂŒhrt, welche die Platzierung in wesentlich grĂ¶ĂŸeren ReferenzbĂ€umen ermöglicht. HierfĂŒr verwenden wir einen aktiven Speicherverwaltungsansatz, bei dem reduzierter Speicherverbrauch gegen grĂ¶ĂŸere AusfĂŒhrungszeiten eingetauscht wird. ZusĂ€tzlich prĂ€sentiere ich einen massiv-parallelen Ansatz um die DiversitĂ€t einer Probe zu quantifizieren, welcher auf den Ergebnissen phylogenetischer Platzierungen basiert. Diese Software, genannt \toolname{SCRAPP}, kombiniert aktuelle Methoden fĂŒr die Maximum-Likelihood basierte phylogenetische Inferenz mit Methoden zur Abgrenzung molekularer Spezien. Daraus resultiert eine Verteilung der Artenanzahl auf den Kanten eines Referenzbaums fĂŒr eine gegebene Probe. DarĂŒber hinaus beschreibe ich einen neuartigen Ansatz zum Clustering von Platzierungsergebnissen, anhand dessen der Benutzer den Rechenaufwand reduzieren kann

    Efficient estimation of evolutionary distances

    Get PDF
    The advent of high throughput sequencers has lead to a dramatic increase in the size of available genomic data. Standard methods, which have worked well for many years, are not suitable for the analysis of big data sets, due to their reliance on a time-consuming alignment step. In this thesis, a new alignment-free approach for phylogeny reconstruction is introduced. The corresponding program, andi, is orders of magnitude faster than classical approaches and also superior to comparable alignment-free methods. The central data structure in andi is the enhanced suffix array. It is used to find long exact matches between sequences. In this thesis, various approaches to the construction of enhanced suffix arrays, including novel ones, are evaluated with respect to performance. Additionally, a new parallel algorithm for the computation of suffix arrays is introduced
    • 

    corecore