1,346 research outputs found

    A Graph Auto-Encoder for Haplotype Assembly and Viral Quasispecies Reconstruction

    Full text link
    Reconstructing components of a genomic mixture from data obtained by means of DNA sequencing is a challenging problem encountered in a variety of applications including single individual haplotyping and studies of viral communities. High-throughput DNA sequencing platforms oversample mixture components to provide massive amounts of reads whose relative positions can be determined by mapping the reads to a known reference genome; assembly of the components, however, requires discovery of the reads' origin -- an NP-hard problem that the existing methods struggle to solve with the required level of accuracy. In this paper, we present a learning framework based on a graph auto-encoder designed to exploit structural properties of sequencing data. The algorithm is a neural network which essentially trains to ignore sequencing errors and infers the posteriori probabilities of the origin of sequencing reads. Mixture components are then reconstructed by finding consensus of the reads determined to originate from the same genomic component. Results on realistic synthetic as well as experimental data demonstrate that the proposed framework reliably assembles haplotypes and reconstructs viral communities, often significantly outperforming state-of-the-art techniques

    Nephele: genotyping via complete composition vectors and MapReduce

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Current sequencing technology makes it practical to sequence many samples of a given organism, raising new challenges for the processing and interpretation of large genomics data sets with associated metadata. Traditional computational phylogenetic methods are ideal for studying the evolution of gene/protein families and using those to infer the evolution of an organism, but are less than ideal for the study of the whole organism mainly due to the presence of insertions/deletions/rearrangements. These methods provide the researcher with the ability to group a set of samples into distinct genotypic groups based on sequence similarity, which can then be associated with metadata, such as host information, pathogenicity, and time or location of occurrence. Genotyping is critical to understanding, at a genomic level, the origin and spread of infectious diseases. Increasingly, genotyping is coming into use for disease surveillance activities, as well as for microbial forensics. The classic genotyping approach has been based on phylogenetic analysis, starting with a multiple sequence alignment. Genotypes are then established by expert examination of phylogenetic trees. However, these traditional single-processor methods are suboptimal for rapidly growing sequence datasets being generated by next-generation DNA sequencing machines, because they increase in computational complexity quickly with the number of sequences.</p> <p>Results</p> <p>Nephele is a suite of tools that uses the complete composition vector algorithm to represent each sequence in the dataset as a vector derived from its constituent k-mers by passing the need for multiple sequence alignment, and affinity propagation clustering to group the sequences into genotypes based on a distance measure over the vectors. Our methods produce results that correlate well with expert-defined clades or genotypes, at a fraction of the computational cost of traditional phylogenetic methods run on traditional hardware. Nephele can use the open-source Hadoop implementation of MapReduce to parallelize execution using multiple compute nodes. We were able to generate a neighbour-joined tree of over 10,000 16S samples in less than 2 hours.</p> <p>Conclusions</p> <p>We conclude that using Nephele can substantially decrease the processing time required for generating genotype trees of tens to hundreds of organisms at genome scale sequence coverage.</p

    Simultaneous identification of specifically interacting paralogs and inter-protein contacts by Direct-Coupling Analysis

    Full text link
    Understanding protein-protein interactions is central to our understanding of almost all complex biological processes. Computational tools exploiting rapidly growing genomic databases to characterize protein-protein interactions are urgently needed. Such methods should connect multiple scales from evolutionary conserved interactions between families of homologous proteins, over the identification of specifically interacting proteins in the case of multiple paralogs inside a species, down to the prediction of residues being in physical contact across interaction interfaces. Statistical inference methods detecting residue-residue coevolution have recently triggered considerable progress in using sequence data for quaternary protein structure prediction; they require, however, large joint alignments of homologous protein pairs known to interact. The generation of such alignments is a complex computational task on its own; application of coevolutionary modeling has in turn been restricted to proteins without paralogs, or to bacterial systems with the corresponding coding genes being co-localized in operons. Here we show that the Direct-Coupling Analysis of residue coevolution can be extended to connect the different scales, and simultaneously to match interacting paralogs, to identify inter-protein residue-residue contacts and to discriminate interacting from noninteracting families in a multiprotein system. Our results extend the potential applications of coevolutionary analysis far beyond cases treatable so far.Comment: Main Text 19 pages Supp. Inf. 16 page

    Inverse Ising inference with correlated samples

    Get PDF
    Correlations between two variables of a high-dimensional system can be indicative of an underlying interaction, but can also result from indirect effects. Inverse Ising inference is a method to distinguish one from the other. Essentially, the parameters of the least constrained statistical model are learned from the observed correlations such that direct interactions can be separated from indirect correlations. Among many other applications, this approach has been helpful for protein structure prediction, because residues which interact in the 3D structure often show correlated substitutions in a multiple sequence alignment. In this context, samples used for inference are not independent but share an evolutionary history on a phylogenetic tree. Here, we discuss the effects of correlations between samples on global inference. Such correlations could arise due to phylogeny but also via other slow dynamical processes. We present a simple analytical model to address the resulting inference biases, and develop an exact method accounting for background correlations in alignment data by combining phylogenetic modeling with an adaptive cluster expansion algorithm. We find that popular reweighting schemes are only marginally effective at removing phylogenetic bias, suggest a rescaling strategy that yields better results, and provide evidence that our conclusions carry over to the frequently used mean-field approach to the inverse Ising problem.Comment: 18 pages, 6 figures; accepted at New J Phy

    Novel Algorithms and Methodology to Help Unravel Secrets that Next Generation Sequencing Data Can Tell

    Get PDF
    The genome of an organism is its complete set of DNA nucleotides, spanning all of its genes and also of its non-coding regions. It contains most of the information necessary to build and maintain an organism. It is therefore no surprise that sequencing the genome provides an invaluable tool for the scientific study of an organism. Via the inference of an evolutionary (phylogenetic) tree, DNA sequences can be used to reconstruct the evolutionary history of a set of species. DNA sequences, or genotype data, has also proven useful for predicting an organisms’ phenotype (i. e. observed traits) from its genotype. This is the objective of association studies. While methods for finding the DNA sequence of an organism have existed for decades, the recent advent of Next Generation Sequencing (NGS) has meant that the availability of such data has increased to such an extent that the computational challenges that now form an integral part of biological studies can no longer be ignored. By focusing on phylogenetics and Genome-Wide Association Studies (GWAS), this thesis aims to help address some of these challenges. As a consequence this thesis is in two parts with the first one centring on phylogenetics and the second one on GWAS. In the first part, we present theoretical insights for reconstructing phylogenetic trees from incomplete distances. This problem is important in the context of NGS data as incomplete pairwise distances between organisms occur frequently with such input and ignoring taxa for which information is missing can introduce undesirable bias. In the second part we focus on the problem of inferring population stratification between individuals in a dataset due to reproductive isolation. While powerful methods for doing this have been proposed in the literature, they tend to struggle when faced with the sheer volume of data that comes with NGS. To help address this problem we introduce the novel PSIKO software and show that it scales very well when dealing with large NGS datasets

    An efficient parallel algorithm for haplotype inference based on rule based approach and consensus methods.

    Get PDF

    RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies

    Get PDF
    Motivation: Phylogenies are increasingly used in all fields of medical and biological research. Moreover, because of the next-generation sequencing revolution, datasets used for conducting phylogenetic analyses grow at an unprecedented pace. RAxML (Randomized Axelerated Maximum Likelihood) is a popular program for phylogenetic analyses of large datasets under maximum likelihood. Since the last RAxML paper in 2006, it has been continuously maintained and extended to accommodate the increasingly growing input datasets and to serve the needs of the user community. Results: I present some of the most notable new features and extensions of RAxML, such as a substantial extension of substitution models and supported data types, the introduction of SSE3, AVX and AVX2 vector intrinsics, techniques for reducing the memory require- ments of the code and a plethora of operations for conducting post-analyses on sets of trees. In addition, an up-to-date 50-page user manual covering all new RAxML options is available

    Estimating sample-specific regulatory networks

    Full text link
    Biological systems are driven by intricate interactions among the complex array of molecules that comprise the cell. Many methods have been developed to reconstruct network models of those interactions. These methods often draw on large numbers of samples with measured gene expression profiles to infer connections between genes (or gene products). The result is an aggregate network model representing a single estimate for the likelihood of each interaction, or "edge," in the network. While informative, aggregate models fail to capture the heterogeneity that is represented in any population. Here we propose a method to reverse engineer sample-specific networks from aggregate network models. We demonstrate the accuracy and applicability of our approach in several data sets, including simulated data, microarray expression data from synchronized yeast cells, and RNA-seq data collected from human lymphoblastoid cell lines. We show that these sample-specific networks can be used to study changes in network topology across time and to characterize shifts in gene regulation that may not be apparent in expression data. We believe the ability to generate sample-specific networks will greatly facilitate the application of network methods to the increasingly large, complex, and heterogeneous multi-omic data sets that are currently being generated, and ultimately support the emerging field of precision network medicine
    • …
    corecore