654 research outputs found
Efficient representation of uncertainty in multiple sequence alignments using directed acyclic graphs
Background
A standard procedure in many areas of bioinformatics is to use a single multiple sequence alignment (MSA) as the basis for various types of analysis. However, downstream results may be highly sensitive to the alignment used, and neglecting the uncertainty in the alignment can lead to significant bias in the resulting inference. In recent years, a number of approaches have been developed for probabilistic sampling of alignments, rather than simply generating a single optimum. However, this type of probabilistic information is currently not widely used in the context of downstream inference, since most existing algorithms are set up to make use of a single alignment.
Results
In this work we present a framework for representing a set of sampled alignments as a directed acyclic graph (DAG) whose nodes are alignment columns; each path through this DAG then represents a valid alignment. Since the probabilities of individual columns can be estimated from empirical frequencies, this approach enables sample-based estimation of posterior alignment probabilities. Moreover, due to conditional independencies between columns, the graph structure encodes a much larger set of alignments than the original set of sampled MSAs, such that the effective sample size is greatly increased.
Conclusions
The alignment DAG provides a natural way to represent a distribution in the space of MSAs, and allows for existing algorithms to be efficiently scaled up to operate on large sets of alignments. As an example, we show how this can be used to compute marginal probabilities for tree topologies, averaging over a very large number of MSAs. This framework can also be used to generate a statistically meaningful summary alignment; example applications show that this summary alignment is consistently more accurate than the majority of the alignment samples, leading to improvements in downstream tree inference.
Implementations of the methods described in this article are available at http://statalign.github.io/WeaveAlign webcite
Efficient Time and Space Representation of Uncertain Event Data
Process mining is a discipline which concerns the analysis of execution data
of operational processes, the extraction of models from event data, the
measurement of the conformance between event data and normative models, and the
enhancement of all aspects of processes. Most approaches assume that event data
is accurately capture behavior. However, this is not realistic in many
applications: data can contain uncertainty, generated from errors in recording,
imprecise measurements, and other factors. Recently, new methods have been
developed to analyze event data containing uncertainty; these techniques
prominently rely on representing uncertain event data by means of graph-based
models explicitly capturing uncertainty. In this paper, we introduce a new
approach to efficiently calculate a graph representation of the behavior
contained in an uncertain process trace. We present our novel algorithm, prove
its asymptotic time complexity, and show experimental results that highlight
order-of-magnitude performance improvements for the behavior graph
construction.Comment: 34 pages, 16 figures, 5 table
Learning Large-Scale Bayesian Networks with the sparsebn Package
Learning graphical models from data is an important problem with wide
applications, ranging from genomics to the social sciences. Nowadays datasets
often have upwards of thousands---sometimes tens or hundreds of thousands---of
variables and far fewer samples. To meet this challenge, we have developed a
new R package called sparsebn for learning the structure of large, sparse
graphical models with a focus on Bayesian networks. While there are many
existing software packages for this task, this package focuses on the unique
setting of learning large networks from high-dimensional data, possibly with
interventions. As such, the methods provided place a premium on scalability and
consistency in a high-dimensional setting. Furthermore, in the presence of
interventions, the methods implemented here achieve the goal of learning a
causal network from data. Additionally, the sparsebn package is fully
compatible with existing software packages for network analysis.Comment: To appear in the Journal of Statistical Software, 39 pages, 7 figure
New Algorithms for Fast and Economic Assembly: Advances in Transcriptome and Genome Assembly
Great efforts have been devoted to decipher the sequence composition of
the genomes and transcriptomes of diverse organisms. Continuing advances in
high-throughput sequencing technologies have led to a decline in associated
costs, facilitating a rapid increase in the amount of available genetic data. In
particular genome studies have undergone a fundamental paradigm shift where
genome projects are no longer limited by sequencing costs, but rather by
computational problems associated with assembly. There is an urgent demand
for more efficient and more accurate methods. Most recently, “hybrid”
methods that integrate short- and long-read data have been devised to address
this need. LazyB is a new, low-cost hybrid genome assembler. It starts from a
bipartite overlap graph between long reads and restrictively filtered short-read
unitigs. This graph is translated into a long-read overlap graph. By design,
unitigs are both unique and almost free of assembly errors. As a consequence,
only few spurious overlaps are introduced into the graph. Instead of the more
conventional approach of removing tips, bubbles, and other local features,
LazyB extracts subgraphs whose global properties approach a disjoint union of
paths in multiple steps, utilizing properties of proper interval graphs. A
prototype implementation of LazyB, entirely written in Python, not only yields
significantly more accurate assemblies of the yeast, fruit fly, and human
genomes compared to state-of-the-art pipelines, but also requires much less
computational effort. An optimized C++ implementation dubbed MuCHSALSA
further significantly reduces resource demands.
Advances in RNA-seq have facilitated tremendous insights into the role of
both coding and non-coding transcripts. Yet, the complete and accurate
annotation of the transciptomes of even model organisms has remained elusive.
RNA-seq produces reads significantly shorter than the average distance
between related splice events and presents high noise levels and other biases
The computational reconstruction remains a critical bottleneck.
Ryūtō implements an extension of common splice graphs facilitating the integration
of reads spanning multiple splice sites and paired-end reads bridging distant
transcript parts. The decomposition of read coverage patterns is modeled as a
minimum-cost flow problem. Using phasing information from multi-splice and
paired-end reads, nodes with uncertain connections are decomposed step-wise
via Linear Programming.
Ryūtōs performance compares favorably with
state-of-the-art methods on both simulated and real-life datasets. Despite
ongoing research and our own contributions, progress on traditional single
sample assembly has brought no major breakthrough. Multi-sample RNA-Seq
experiments provide more information which, however, is challenging to utilize
due to the large amount of accumulating errors. An extension to Ryūtō
enables the reconstruction of consensus transcriptomes from multiple RNA-seq
data sets, incorporating consensus calling at low level features. Benchmarks
show stable improvements already at 3 replicates.
Ryūtō outperforms competing approaches, providing a better and user-adjustable
sensitivity-precision trade-off. Ryūtō consistently improves assembly on
replicates, demonstrable also when mixing conditions or time series and for
differential expression analysis. Ryūtōs approach towards guided assembly is
equally unique. It allows users to adjust results based on the quality of the
guide, even for multi-sample assembly.:1 Preface
1.1 Assembly: A vast and fast evolving field
1.2 Structure of this Work
1.3 Available
2 Introduction
2.1 Mathematical Background
2.2 High-Throughput Sequencing
2.3 Assembly
2.4 Transcriptome Expression
3 From LazyB to MuCHSALSA - Fast and Cheap Genome Assembly
3.1 Background
3.2 Strategy
3.3 Data preprocessing
3.4 Processing of the overlap graph
3.5 Post Processing of the Path Decomposition
3.6 Benchmarking
3.7 MuCHSALSA – Moving towards the future
4 Ryūtō - Versatile, Fast, and Effective Transcript Assembly
4.1 Background
4.2 Strategy
4.3 The Ryūtō core algorithm
4.4 Improved Multi-sample transcript assembly with Ryūtō
5 Conclusion & Future Work
5.1 Discussion and Outlook
5.2 Summary and Conclusio
Metabolic Network Alignments and their Applications
The accumulation of high-throughput genomic and proteomic data allows for the reconstruction of the increasingly large and complex metabolic networks. In order to analyze the accumulated data and reconstructed networks, it is critical to identify network patterns and evolutionary relations between metabolic networks. But even finding similar networks becomes computationally challenging. The dissertation addresses these challenges with discrete optimization and the corresponding algorithmic techniques. Based on the property of the gene duplication and function sharing in biological network,we have formulated the network alignment problem which asks the optimal vertex-to-vertex mapping allowing path contraction, vertex deletion, and vertex insertions. We have proposed the first polynomial time algorithm for aligning an acyclic metabolic pattern pathway with an arbitrary metabolic network. We also have proposed a polynomial-time algorithm for patterns with small treewidth and implemented it for series-parallel patterns which are commonly found among metabolic networks. We have developed the metabolic network alignment tool for free public use. We have performed pairwise mapping of all pathways among five organisms and found a set of statistically significant pathway similarities. We also have applied the network alignment to identifying inconsistency, inferring missing enzymes, and finding potential candidates
NovoGraph: Human genome graph construction from multiple long-read de novo assemblies [version 2; referees: 2 approved]
Genome graphs are emerging as an important novel approach to the analysis of high-throughput human sequencing data. By explicitly representing genetic variants and alternative haplotypes in a mappable data structure, they can enable the improved analysis of structurally variable and hyperpolymorphic regions of the genome. In most existing approaches, graphs are constructed from variant call sets derived from short-read sequencing. As long-read sequencing becomes more cost-effective and enables de novo assembly for increasing numbers of whole genomes, a method for the direct construction of a genome graph from sets of assembled human genomes would be desirable. Such assembly-based genome graphs would encompass the wide spectrum of genetic variation accessible to long-read-based de novo assembly, including large structural variants and divergent haplotypes. Here we present NovoGraph, a method for the construction of a human genome graph directly from a set of de novo assemblies. NovoGraph constructs a genome-wide multiple sequence alignment of all input contigs and creates a graph by merging the input sequences at positions that are both homologous and sequence-identical. NovoGraph outputs resulting graphs in VCF format that can be loaded into third-party genome graph toolkits. To demonstrate NovoGraph, we construct a genome graph with 23,478,835 variant sites and 30,582,795 variant alleles from de novo assemblies of seven ethnically diverse human genomes (AK1, CHM1, CHM13, HG003, HG004, HX1, NA19240). Initial evaluations show that mapping against the constructed graph reduces the average mismatch rate of reads from sample NA12878 by approximately 0.2%, albeit at a slightly increased rate of reads that remain unmapped
Fast Statistical Alignment
We describe a new program for the alignment of multiple biological sequences that is both statistically motivated and fast enough for problem sizes that arise in practice. Our Fast Statistical Alignment program is based on pair hidden Markov models which approximate an insertion/deletion process on a tree and uses a sequence annealing algorithm to combine the posterior probabilities estimated from these models into a multiple alignment. FSA uses its explicit statistical model to produce multiple alignments which are accompanied by estimates of the alignment accuracy and uncertainty for every column and character of the alignment—previously available only with alignment programs which use computationally-expensive Markov Chain Monte Carlo approaches—yet can align thousands of long sequences. Moreover, FSA utilizes an unsupervised query-specific learning procedure for parameter estimation which leads to improved accuracy on benchmark reference alignments in comparison to existing programs. The centroid alignment approach taken by FSA, in combination with its learning procedure, drastically reduces the amount of false-positive alignment on biological data in comparison to that given by other methods. The FSA program and a companion visualization tool for exploring uncertainty in alignments can be used via a web interface at http://orangutan.math.berkeley.edu/fsa/, and the source code is available at http://fsa.sourceforge.net/
- …