1,346 research outputs found
A Graph Auto-Encoder for Haplotype Assembly and Viral Quasispecies Reconstruction
Reconstructing components of a genomic mixture from data obtained by means of
DNA sequencing is a challenging problem encountered in a variety of
applications including single individual haplotyping and studies of viral
communities. High-throughput DNA sequencing platforms oversample mixture
components to provide massive amounts of reads whose relative positions can be
determined by mapping the reads to a known reference genome; assembly of the
components, however, requires discovery of the reads' origin -- an NP-hard
problem that the existing methods struggle to solve with the required level of
accuracy. In this paper, we present a learning framework based on a graph
auto-encoder designed to exploit structural properties of sequencing data. The
algorithm is a neural network which essentially trains to ignore sequencing
errors and infers the posteriori probabilities of the origin of sequencing
reads. Mixture components are then reconstructed by finding consensus of the
reads determined to originate from the same genomic component. Results on
realistic synthetic as well as experimental data demonstrate that the proposed
framework reliably assembles haplotypes and reconstructs viral communities,
often significantly outperforming state-of-the-art techniques
Nephele: genotyping via complete composition vectors and MapReduce
<p>Abstract</p> <p>Background</p> <p>Current sequencing technology makes it practical to sequence many samples of a given organism, raising new challenges for the processing and interpretation of large genomics data sets with associated metadata. Traditional computational phylogenetic methods are ideal for studying the evolution of gene/protein families and using those to infer the evolution of an organism, but are less than ideal for the study of the whole organism mainly due to the presence of insertions/deletions/rearrangements. These methods provide the researcher with the ability to group a set of samples into distinct genotypic groups based on sequence similarity, which can then be associated with metadata, such as host information, pathogenicity, and time or location of occurrence. Genotyping is critical to understanding, at a genomic level, the origin and spread of infectious diseases. Increasingly, genotyping is coming into use for disease surveillance activities, as well as for microbial forensics. The classic genotyping approach has been based on phylogenetic analysis, starting with a multiple sequence alignment. Genotypes are then established by expert examination of phylogenetic trees. However, these traditional single-processor methods are suboptimal for rapidly growing sequence datasets being generated by next-generation DNA sequencing machines, because they increase in computational complexity quickly with the number of sequences.</p> <p>Results</p> <p>Nephele is a suite of tools that uses the complete composition vector algorithm to represent each sequence in the dataset as a vector derived from its constituent k-mers by passing the need for multiple sequence alignment, and affinity propagation clustering to group the sequences into genotypes based on a distance measure over the vectors. Our methods produce results that correlate well with expert-defined clades or genotypes, at a fraction of the computational cost of traditional phylogenetic methods run on traditional hardware. Nephele can use the open-source Hadoop implementation of MapReduce to parallelize execution using multiple compute nodes. We were able to generate a neighbour-joined tree of over 10,000 16S samples in less than 2 hours.</p> <p>Conclusions</p> <p>We conclude that using Nephele can substantially decrease the processing time required for generating genotype trees of tens to hundreds of organisms at genome scale sequence coverage.</p
Simultaneous identification of specifically interacting paralogs and inter-protein contacts by Direct-Coupling Analysis
Understanding protein-protein interactions is central to our understanding of
almost all complex biological processes. Computational tools exploiting rapidly
growing genomic databases to characterize protein-protein interactions are
urgently needed. Such methods should connect multiple scales from evolutionary
conserved interactions between families of homologous proteins, over the
identification of specifically interacting proteins in the case of multiple
paralogs inside a species, down to the prediction of residues being in physical
contact across interaction interfaces. Statistical inference methods detecting
residue-residue coevolution have recently triggered considerable progress in
using sequence data for quaternary protein structure prediction; they require,
however, large joint alignments of homologous protein pairs known to interact.
The generation of such alignments is a complex computational task on its own;
application of coevolutionary modeling has in turn been restricted to proteins
without paralogs, or to bacterial systems with the corresponding coding genes
being co-localized in operons. Here we show that the Direct-Coupling Analysis
of residue coevolution can be extended to connect the different scales, and
simultaneously to match interacting paralogs, to identify inter-protein
residue-residue contacts and to discriminate interacting from noninteracting
families in a multiprotein system. Our results extend the potential
applications of coevolutionary analysis far beyond cases treatable so far.Comment: Main Text 19 pages Supp. Inf. 16 page
Inverse Ising inference with correlated samples
Correlations between two variables of a high-dimensional system can be
indicative of an underlying interaction, but can also result from indirect
effects. Inverse Ising inference is a method to distinguish one from the other.
Essentially, the parameters of the least constrained statistical model are
learned from the observed correlations such that direct interactions can be
separated from indirect correlations. Among many other applications, this
approach has been helpful for protein structure prediction, because residues
which interact in the 3D structure often show correlated substitutions in a
multiple sequence alignment. In this context, samples used for inference are
not independent but share an evolutionary history on a phylogenetic tree. Here,
we discuss the effects of correlations between samples on global inference.
Such correlations could arise due to phylogeny but also via other slow
dynamical processes. We present a simple analytical model to address the
resulting inference biases, and develop an exact method accounting for
background correlations in alignment data by combining phylogenetic modeling
with an adaptive cluster expansion algorithm. We find that popular reweighting
schemes are only marginally effective at removing phylogenetic bias, suggest a
rescaling strategy that yields better results, and provide evidence that our
conclusions carry over to the frequently used mean-field approach to the
inverse Ising problem.Comment: 18 pages, 6 figures; accepted at New J Phy
Novel Algorithms and Methodology to Help Unravel Secrets that Next Generation Sequencing Data Can Tell
The genome of an organism is its complete set of DNA nucleotides, spanning
all of its genes and also of its non-coding regions. It contains most of
the information necessary to build and maintain an organism. It is therefore
no surprise that sequencing the genome provides an invaluable tool for
the scientific study of an organism. Via the inference of an evolutionary
(phylogenetic) tree, DNA sequences can be used to reconstruct the evolutionary
history of a set of species. DNA sequences, or genotype data, has
also proven useful for predicting an organisms’ phenotype (i. e. observed
traits) from its genotype. This is the objective of association studies.
While methods for finding the DNA sequence of an organism have existed
for decades, the recent advent of Next Generation Sequencing (NGS) has
meant that the availability of such data has increased to such an extent
that the computational challenges that now form an integral part of biological
studies can no longer be ignored. By focusing on phylogenetics
and Genome-Wide Association Studies (GWAS), this thesis aims to help
address some of these challenges. As a consequence this thesis is in two
parts with the first one centring on phylogenetics and the second one on
GWAS.
In the first part, we present theoretical insights for reconstructing phylogenetic
trees from incomplete distances. This problem is important in the
context of NGS data as incomplete pairwise distances between organisms
occur frequently with such input and ignoring taxa for which information
is missing can introduce undesirable bias. In the second part we focus on
the problem of inferring population stratification between individuals in a
dataset due to reproductive isolation. While powerful methods for doing
this have been proposed in the literature, they tend to struggle when faced
with the sheer volume of data that comes with NGS. To help address this
problem we introduce the novel PSIKO software and show that it scales
very well when dealing with large NGS datasets
RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies
Motivation: Phylogenies are increasingly used in all fields of medical and biological research. Moreover, because of the next-generation sequencing revolution, datasets used for conducting phylogenetic analyses grow at an unprecedented pace. RAxML (Randomized Axelerated Maximum Likelihood) is a popular program for phylogenetic analyses of large datasets under maximum likelihood. Since the last RAxML paper in 2006, it has been continuously maintained and extended to accommodate the increasingly growing input datasets and to serve the needs of the user community.
Results: I present some of the most notable new features and extensions of RAxML, such as a substantial extension of substitution models and supported data types, the introduction of SSE3, AVX and AVX2 vector intrinsics, techniques for reducing the memory require- ments of the code and a plethora of operations for conducting post-analyses on sets of trees. In addition, an up-to-date 50-page user manual covering all new RAxML options is available
Estimating sample-specific regulatory networks
Biological systems are driven by intricate interactions among the complex
array of molecules that comprise the cell. Many methods have been developed to
reconstruct network models of those interactions. These methods often draw on
large numbers of samples with measured gene expression profiles to infer
connections between genes (or gene products). The result is an aggregate
network model representing a single estimate for the likelihood of each
interaction, or "edge," in the network. While informative, aggregate models
fail to capture the heterogeneity that is represented in any population. Here
we propose a method to reverse engineer sample-specific networks from aggregate
network models. We demonstrate the accuracy and applicability of our approach
in several data sets, including simulated data, microarray expression data from
synchronized yeast cells, and RNA-seq data collected from human lymphoblastoid
cell lines. We show that these sample-specific networks can be used to study
changes in network topology across time and to characterize shifts in gene
regulation that may not be apparent in expression data. We believe the ability
to generate sample-specific networks will greatly facilitate the application of
network methods to the increasingly large, complex, and heterogeneous
multi-omic data sets that are currently being generated, and ultimately support
the emerging field of precision network medicine
- …