Search CORE

183 research outputs found

Efficient Storage of Genomic Sequences in High Performance Computing Systems

Author: Guerra Soler Aníbal José
Publication venue: Medellín, Colombia
Publication date: 01/01/2019
Field of study

ABSTRACT: In this dissertation, we address the challenges of genomic data storage in high performance computing systems. In particular, we focus on developing a referential compression approach for Next Generation Sequence data stored in FASTQ format files. The amount of genomic data available for researchers to process has increased exponentially, bringing enormous challenges for its efficient storage and transmission. General-purpose compressors can only offer limited performance for genomic data, thus the need for specialized compression solutions. Two trends have emerged as alternatives to harness the particular properties of genomic data: non-referential and referential compression. Non-referential compressors offer higher compression rations than general purpose compressors, but still below of what a referential compressor could theoretically achieve. However, the effectiveness of referential compression depends on selecting a good reference and on having enough computing resources available. This thesis presents one of the first referential compressors for FASTQ files. We first present a comprehensive analytical and experimental evaluation of the most relevant tools for genomic raw data compression, which led us to identify the main needs and opportunities in this field. As a consequence, we propose a novel compression workflow that aims at improving the usability of referential compressors. Subsequently, we discuss the implementation and performance evaluation for the core of the proposed workflow: a referential compressor for reads in FASTQ format that combines local read-to-reference alignments with a specialized binary-encoding strategy. The compression algorithm, named UdeACompress, achieved very competitive compression ratios when compared to the best compressors in the current state of the art, while showing reasonable execution times and memory use. In particular, UdeACompress outperformed all competitors when compressing long reads, typical of the newest sequencing technologies. Finally, we study the main aspects of the data-level parallelism in the Intel AVX-512 architecture, in order to develop a parallel version of the UdeACompress algorithms to reduce the runtime. Through the use of SIMD programming, we managed to significantly accelerate the main bottleneck found in UdeACompress, the Suffix Array Construction

Biblioteca Digital del Sistema de Bibliotecas de la Universidad de Antioquia

BRIDES: A New Fast Algorithm and Software for Characterizing Evolving Similarity Networks Using Breakthroughs, Roadblocks, Impasses, Detours, Equals and Shortcuts

Author: Bapteste Éric
Lapointe François-Joseph
Le Cam Margaux
Lord Etienne
Makarenkov Vladimir
Méheust Raphaël
Publication venue: Public Library of Science
Publication date: 01/01/2016
Field of study

International audienceVarious types of genome and gene similarity networks along with their characteristics have been increasingly used for retracing different kinds of evolutionary and ecological relationships. Here, we present a new polynomial time algorithm and the corresponding software (BRIDES) to provide characterization of different types of paths existing in evolving (or augmented) similarity networks under the constraint that such paths contain at least one node that was not present in the original network. These different paths are denoted as Breakthroughs , Roadblocks, Impasses, Detours, Equal paths, and Shortcuts. The analysis of their distribution can allow discriminating among different evolutionary hypotheses concerning genomes or genes at hand. Our approach is based on an original application of the popular shortest path Dijkstra's and Yen's algorithms

HAL-UNICE

Directory of Open Access Journals

PubMed Central

FigShare

Efficient Algorithms for Prokaryotic Whole Genome Assembly and Finishing

Author: Biswas Abhishek
Publication venue: ODU Digital Commons
Publication date: 01/10/2015
Field of study

De-novo genome assembly from DNA fragments is primarily based on sequence overlap information. In addition, mate-pair reads or paired-end reads provide linking information for joining gaps and bridging repeat regions. Genome assemblers in general assemble long contiguous sequences (contigs) using both overlapping reads and linked reads until the assembly runs into an ambiguous repeat region. These contigs are further bridged into scaffolds using linked read information. However, errors can be made in both phases of assembly due to high error threshold of overlap acceptance and linking based on too few mate reads. Identical as well as similar repeat regions can often cause errors in overlap and mate-pair evidence. In addition, the problem of setting the correct threshold to minimize errors and optimize assembly of reads is not trivial and often requires a time-consuming trial and error process to obtain optimal results. The typical trial-and-error with multiple assembler, which can be computationally intensive, and is very inefficient, especially when users must learn how to use a wide variety of assemblers, many of which may be serial requiring long execution time and will not return usable or accurate results. Further, we show that the comparison of assembly results may not provide the users with a clear winner under all circumstances. Therefore, we propose a novel scaffolding tool, Correlative Algorithm for Repeat Placement (CARP), capable of joining short low error contigs using mate pair reads, computationally resolved repeat structures and synteny with one or more reference organisms. The CARP tool requires a set of repeat sequences such as insertion sequences (IS) that can be found computationally found without assembling the genome. Development of methods to identify such repeating regions directly from raw sequence reads or draft genomes led to the development of the ISQuest software package. ISQuest identifies bacterial ISs and their sequence elements—inverted and direct repeats—in raw read data or contigs using flexible search parameters. ISQuest is capable of finding ISs in hundreds of partially assembled genomes within hours; making it a valuable high-throughput tool for a global search of IS and repeat elements. The CARP tool matches very low error contigs with strong overlap using the ambiguous partial repeat sequence at the ends of the contig annotated using the repeat sequences discovered using ISQuest. These matches are verified by synteny with genomes of one or more reference organisms. We show that the CARP tool can be used to verify low mate pair evidence regions, independently find new joins and significantly reduce the number of scaffolds. Finally, we are demonstrate a novel viewer that presents to the user the computationally derived joins along with the evidence used to make the joins. The viewer allows the user to independently assess their confidence in the joins made by the finishing tools and make an informed decision of whether to invest the resources necessary to confirm a particular portion of the assembly. Further, we allow users to manually record join evidence, re-order contigs, and track the assembly finishing process

Old Dominion University

Mandrake : visualizing microbial population structure by embedding millions of genomes into a low-dimensional representation

Author: Corander Jukka
Lees John A.
Tonkin-Hill Gerry
Yang Zhirong
Publication venue
Publication date: 22/08/2022
Field of study

In less than a decade, population genomics of microbes has progressed from the effort of sequencing dozens of strains to thousands, or even tens of thousands of strains in a single study. There are now hundreds of thousands of genomes available even for a single bacterial species, and the number of genomes is expected to continue to increase at an accelerated pace given the advances in sequencing technology and widespread genomic surveillance initiatives. This explosion of data calls for innovative methods to enable rapid exploration of the structure of a population based on different data modalities, such as multiple sequence alignments, assemblies and estimates of gene content across different genomes. Here, we present Mandrake, an efficient implementation of a dimensional reduction method tailored for the needs of large-scale population genomics. Mandrake is capable of visualizing population structure from millions of whole genomes, and we illustrate its usefulness with several datasets representing major pathogens. Our method is freely available both as an analysis pipeline (https://github.com/johnlees/mandrake) and as a browser-based interactive application (https://gtonkinhill.github.io/mandrake-web/).This article is part of a discussion meeting issue 'Genomic population structures of microbial pathogens'.Peer reviewe

PubMed Central

Aaltodoc Publication Archive

Helsingin yliopiston digitaalinen arkisto

A highly efficient multi-core algorithm for clustering extremely large datasets

Author: A Ben-Hur
A Bertoni
A Jain
AK Jain
AR Adl-Tabatabai
AWF Edwards
B Andreopoulos
B Chapman
C Herzeel
Consortium IH
D Lea
D Smirnov
DR Barr
E Levine
F Müller
G Dalgin
HA Kestler
HA Kestler
Hans A Kestler
HW Kuhn
J Fridlyand
J Handl
J Larus
J MacQueen
Johann M Kraus
JW Sammon
K Fukunaga
L Hubert
L Kuncheva
M Anderson
M Ng
MK Kerr
N Shavit
P Jaccard
P Sham
PA Bernstein
R Development Core Team
R Duan
R Graham
R Jonker
R Rajwar
R Tibshirani
R Xu
RC Gentleman
S Monti
S Peyton-Jones
S Selim
T Kohonen
T Lange
U Drepper
W Feng
W Gropp
W Rand
WJ Conover
X Gao
X Gao
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background In recent years, the demand for computational power in computational biology has increased due to rapidly growing data sets from microarray and other high-throughput technologies. This demand is likely to increase. Standard algorithms for analyzing data, such as cluster algorithms, need to be parallelized for fast processing. Unfortunately, most approaches for parallelizing algorithms largely rely on network communication protocols connecting and requiring multiple computers. One answer to this problem is to utilize the intrinsic capabilities in current multi-core hardware to distribute the tasks among the different cores of one computer. Results We introduce a multi-core parallelization of the k-means and k-modes cluster algorithms based on the design principles of transactional memory for clustering gene expression microarray type data and categorial SNP data. Our new shared memory parallel algorithms show to be highly efficient. We demonstrate their computational power and show their utility in cluster stability and sensitivity analysis employing repeated runs with slightly changed parameters. Computation speed of our Java based algorithm was increased by a factor of 10 for large data sets while preserving computational accuracy compared to single-core implementations and a recently published network based parallelization. Conclusions Most desktop computers and even notebooks provide at least dual-core processors. Our multi-core algorithms show that using modern algorithmic concepts, parallelization makes it possible to perform even such laborious tasks as cluster sensitivity and cluster number estimation on the laboratory computer.</p

CiteSeerX

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

FisherMP: fully parallel algorithm for detecting combinatorial motifs from large ChIP-seq datasets.

Author: Chen Yong
Liang Ying
Su Zhengchang
Wang Xiangyun
Zhang Shaoqiang
Publication venue: Rowan Digital Works
Publication date: 01/06/2019
Field of study

Detecting binding motifs of combinatorial transcription factors (TFs) from chromatin immunoprecipitation sequencing (ChIP-seq) experiments is an important and challenging computational problem for understanding gene regulations. Although a number of motif-finding algorithms have been presented, most are either time consuming or have sub-optimal accuracy for processing large-scale datasets. In this article, we present a fully parallelized algorithm for detecting combinatorial motifs from ChIP-seq datasets by using Fisher combined method and OpenMP parallel design. Large scale validations on both synthetic data and 350 ChIP-seq datasets from the ENCODE database showed that FisherMP has not only super speeds on large datasets, but also has high accuracy when compared with multiple popular methods. By using FisherMP, we successfully detected combinatorial motifs of CTCF, YY1, MAZ, STAT3 and USF2 in chromosome X, suggesting that they are functional co-players in gene regulation and chromosomal organization. Integrative and statistical analysis of these TF-binding peaks clearly demonstrate that they are not only highly coordinated with each other, but that they are also correlated with histone modifications. FisherMP can be applied for integrative analysis of binding motifs and for predicting cis-regulatory modules from a large number of ChIP-seq datasets

Rowan University

Crystallographic Analysis and Molecular Modeling Studies of HIV-1 Protease and Drug Resistant Mutants

Author: Shen Chen-Hsiang
Publication venue: ScholarWorks @ Georgia State University
Publication date: 18/12/2014
Field of study

HIV-1 protease (PR) is an effective target protein for drugs in anti-retroviral therapy (ART). Using PR inhibitors (PIs) in clinical therapy successfully reduces mortality of HIV infected patients. However, drug resistant variants are selected in AIDS patients because of the fast evolution of the viral genome. Structural, kinetic and MD simulations of PR variants with or without substrate or PIs were used to better understand the molecular basis of drug resistance. Information obtained from these extensive studies will benefit the design of more effective inhibitor in ART. Amprenavir (APV) inhibition of PRWT, and single mutants of PRV32I, PRI50V, PRI54M, PRI54V, PRI84V and PRL90M were studied and X-ray crystal structures of PR variants complexes with APV were solved at resolutions of 1.02-1.85 Å to identify structural alterations. Crystal structures of PRWT, PRV32I and PRI47V were solved at resolutions of 1.20-1.40 Å. Reaction intermediates were captured in the substrate binding cavity, which represent three consecutive steps in the catalytic reaction of HIV PR. HIV-1 PR20 variant is a multi-drug resistant variant from a clinical isolate and it is of utility to investigate the mechanisms of resistance. The crystal structures of PR20 with inactivating mutation D25N have been determined at 1.45-1.75 Å resolution, and three distinct flap conformations, open, twisted and tucked, were observed. These studies help understand molecular basis of drug resistance and provide clues for design of inhibitors to combat multi-drug resistant PR. The evaluation of electrostatic force in MD simulations is the computationally intensive work, which is of order theta(N2) with integration of all atom pairs. AMMP invokes Amortized FMM in summation of electrostatic force, which reduced work load to theta(N). A hybrid, CPU and GPU, parallel implementation of Amortized FMM was developed and improves the elapsed time of MD simulation 20 fold faster than CPU based parallelization

ScholarWorks @ Georgia State University

Root Digger: a root placement program for phylogenetic trees

Author: Bettisworth Ben
Stamatakis Alexandros
Publication venue: BioMed Central
Publication date: 20/05/2021
Field of study

Background In phylogenetic analysis, it is common to infer unrooted trees. However, knowing the root location is desirable for downstream analyses and interpretation. There exist several methods to recover a root, such as molecular clock analysis (including midpoint rooting) or rooting the tree using an outgroup. Non-reversible Markov models can also be used to compute the likelihood of a potential root position. Results We present a software called RootDigger which uses a non-reversible Markov model to compute the most likely root location on a given tree and to infer a confidence value for each possible root placement. We find that RootDigger is successful at finding roots when compared to similar tools such as IQ-TREE and MAD, and will occasionally outperform them. Additionally, we find that the exhaustive mode of RootDigger is useful in quantifying and explaining uncertainty in rooting positions. Conclusions RootDigger can be used on an existing phylogeny to find a root, or to asses the uncertainty of the root placemen

KITopen

High-Performance approaches for Phylogenetic Placement, and its application to species and diversity quantification

Author: Barbera Pierre
Publication venue: KIT-Bibliothek, Karlsruhe
Publication date: 08/11/2021
Field of study

In den letzten Jahren haben Fortschritte in der Hochdurchsatz-Genesequenzierung, in Verbindung mit dem anhaltenden exponentiellen Wachstum und der Verfügbarkeit von Rechenressourcen, zu fundamental neuen analytischen Ansätzen in der Biologie geführt. Es ist nun möglich den genetischen Inhalt ganzer Organismengemeinschaften anhand einzelner Umweltproben umfassend zu sequenzieren. Solche Methoden sind besonders für die Mikrobiologie relevant. Die Mikrobiologie war zuvor weitgehend auf die Untersuchung jener Mikroben beschränkt, welche im Labor (d.h., in vitro) kultiviert werden konnten, was jedoch lediglich einen kleinen Teil der in der Natur vorkommenden Diversität abdeckt. Im Gegensatz dazu ermöglicht die Hochdurchsatzsequenzierung nun die direkte Erfassung der genetischen Sequenzen eines Mikrobioms, wie es in seiner natürlichen Umgebung vorkommt (d.h., in situ). Ein typisches Ziel von Mikrobiomstudien besteht in der taxonomischen Klassifizierung der in einer Probe enthaltenen Sequenzen (Querysequenzen). Üblicherweise werden phylogenetische Methoden eingesetzt, um detaillierte taxonomische Beziehungen zwischen Querysequenzen und vertrauenswürdigen Referenzsequenzen, die von bereits klassifizierten Organismen stammen, zu bestimmen. Aufgrund des hohen Volumens (

10 ^ 6

bis

10 ^ 9

) von Querysequenzen, die aus einer Mikrobiom-Probe mittels Hochdurchsatzsequenzierung generiert werden können, ist eine akkurate phylogenetische Baumrekonstruktion rechnerisch nicht mehr möglich. Darüber hinaus erzeugen derzeit üblicherweise verwendete Sequenzierungstechnologien vergleichsweise kurze Sequenzen, die ein begrenztes phylogenetisches Signal aufweisen, was zu einer Instabilität bei der Inferenz der Phylogenien aus diesen Sequenzen führt. Ein weiteres typisches Ziel von Mikrobiomstudien besteht in der Quantifizierung der Diversität innerhalb einer Probe, bzw. zwischen mehreren Proben. Auch hierfür werden üblicherweise phylogenetische Methoden verwendet. Oftmals setzen diese Methoden die Inferenz eines phylogenetischen Baumes voraus, welcher entweder alle Sequenzen, oder eine geclusterte Teilmenge dieser Sequenzen, umfasst. Wie bei der taxonomischen Identifizierung können Analysen, die auf dieser Art von Bauminferenz basieren, zu ungenauen Ergebnissen führen und/oder rechnerisch nicht durchführbar sein. Im Gegensatz zu einer umfassenden phylogenetischen Inferenz ist die phylogenetische Platzierung eine Methode, die den phylogenetischen Kontext einer Querysequenz innerhalb eines etablierten Referenzbaumes bestimmt. Dieses Verfahren betrachtet den Referenzbaum typischerweise als unveränderlich, d.h. der Referenzbaum wird vor, während oder nach der Platzierung einer Sequenz nicht geändert. Dies erlaubt die phylogenetische Platzierung einer Sequenz in linearer Zeit in Bezug auf die Größe des Referenzbaums durchzuführen. In Kombination mit taxonomischen Informationen über die Referenzsequenzen ermöglicht die phylogenetische Platzierung somit die taxonomische Identifizierung einer Sequenz. Darüber hinaus erlaubt eine phylogenetische Platzierung die Anwendung einer Vielzahl zusätzlicher Analyseverfahren, die beispielsweise die Zuordnung der Zusammensetzungen humaner Mikrobiome zu klinisch-diagnostischen Eigenschaften ermöglicht. In dieser Dissertation präsentiere ich meine Arbeit bezüglich des Entwurfs, der Implementierung, und Verbesserung von EPA-ng, einer Hochleistungsimplementierung der phylogenetischen Platzierung anhand des Maximum-Likelihood Modells. EPA-ng wurde entwickelt um auf Milliarden von Querysequenzen zu skalieren und auf Tausenden von Kernen in Systemen mit gemeinsamem und verteiltem Speicher ausgeführt zu werden. EPA-ng beschleunigt auch die Verarbeitungsgeschwindigkeit auf einzelnen Kernen um das bis zu

30

-fache, im Vergleich zu dessen direkten Konkurrenzprogrammen. Vor kurzem haben wir eine zusätzliche Methode für EPA-ng eingeführt, welche die Platzierung in wesentlich größeren Referenzbäumen ermöglicht. Hierfür verwenden wir einen aktiven Speicherverwaltungsansatz, bei dem reduzierter Speicherverbrauch gegen größere Ausführungszeiten eingetauscht wird. Zusätzlich präsentiere ich einen massiv-parallelen Ansatz um die Diversität einer Probe zu quantifizieren, welcher auf den Ergebnissen phylogenetischer Platzierungen basiert. Diese Software, genannt \toolname{SCRAPP}, kombiniert aktuelle Methoden für die Maximum-Likelihood basierte phylogenetische Inferenz mit Methoden zur Abgrenzung molekularer Spezien. Daraus resultiert eine Verteilung der Artenanzahl auf den Kanten eines Referenzbaums für eine gegebene Probe. Darüber hinaus beschreibe ich einen neuartigen Ansatz zum Clustering von Platzierungsergebnissen, anhand dessen der Benutzer den Rechenaufwand reduzieren kann

KITopen

Efficient estimation of evolutionary distances

Author: Klötzl F.
Publication venue: Universität zu Lübeck, Institut für Neuro- und Bioinformatik
Publication date: 01/04/2015
Field of study

The advent of high throughput sequencers has lead to a dramatic increase in the size of available genomic data. Standard methods, which have worked well for many years, are not suitable for the analysis of big data sets, due to their reliance on a time-consuming alignment step. In this thesis, a new alignment-free approach for phylogeny reconstruction is introduced. The corresponding program, andi, is orders of magnitude faster than classical approaches and also superior to comparable alignment-free methods. The central data structure in andi is the enhanced suffix array. It is used to find long exact matches between sequences. In this thesis, various approaches to the construction of enhanced suffix arrays, including novel ones, are evaluated with respect to performance. Additionally, a new parallel algorithm for the computation of suffix arrays is introduced

MPG.PuRe