624 research outputs found
Finding conserved patterns in biological sequences, networks and genomes
Biological patterns are widely used for identifying biologically interesting regions
within macromolecules, classifying biological objects, predicting functions and studying
evolution. Good pattern finding algorithms will help biologists to formulate and
validate hypotheses in an attempt to obtain important insights into the complex
mechanisms of living things.
In this dissertation, we aim to improve and develop algorithms for five biological
pattern finding problems. For the multiple sequence alignment problem, we propose
an alternative formulation in which a final alignment is obtained by preserving pairwise
alignments specified by edges of a given tree. In contrast with traditional NPhard
formulations, our preserving alignment formulation can be solved in polynomial
time without using a heuristic, while having very good accuracy.
For the path matching problem, we take advantage of the linearity of the query
path to reduce the problem to finding a longest weighted path in a directed acyclic
graph. We can find k paths with top scores in a network from the query path in
polynomial time. As many biological pathways are not linear, our graph matching
approach allows a non-linear graph query to be given. Our graph matching formulation
overcomes the common weakness of previous approaches that there is no
guarantee on the quality of the results.
For the gene cluster finding problem, we investigate a formulation based on constraining the overall size of a cluster and develop statistical significance estimates that
allow direct comparisons of clusters of different sizes. We explore both a restricted
version which requires that orthologous genes are strictly ordered within each cluster,
and the unrestricted problem that allows paralogous genes within a genome and clusters
that may not appear in every genome. We solve the first problem in polynomial
time and develop practical exact algorithms for the second one.
In the gene cluster querying problem, based on a querying strategy, we propose
an efficient approach for investigating clustering of related genes across multiple
genomes for a given gene cluster. By analyzing gene clustering in 400 bacterial
genomes, we show that our algorithm is efficient enough to study gene clusters across
hundreds of genomes
Finding genomic differences from whole-genome assemblies using SyRI
Genomic differences can range from single nucleotide differences (SNPs) to large complex structural rearrangements. Current methods typically can annotate sequence differences like SNPs and large indels accurately but do not unravel the full complexity of structural rearrangements that include inversions, translocations, and duplications. Structural rearrangements involve changes in location, orientation, or copy-number between highly similar sequences and have been reported to be associated with several biological differences between organisms. However, they are still scantly studied with sequencing technologies as it is still challenging to identify them accurately.
Here I present SyRI, a novel computational method for genome-wide identification of structural differences using the pairwise comparison of whole-genome chromosome-level assemblies. SyRI uses a unique approach where it first identifies all syntenic (structurally conserved) regions between two genomes. Since all non-syntenic regions are structural rearrangements by definition, this transforms the difficult problem of rearrangement identification to a comparatively easier problem of rearrangement classification. SyRI analyses the location, orientation, and copy-number of alignments between rearranged regions and selects alignments that best represent the putative rearrangements and result in the highest total alignment score between the genomes. Next, SyRI searches for sequence differences that are distinguished for residing in syntenic or rearranged regions. This distinction is important, as rearranged regions (and sequence differences within them) do not follow Mendelian Law of Segregation and are therefore inherited differently compared to syntenic regions. Using SyRI, I successfully identified rearrangements in human, A. thaliana, yeast, fruit fly, and maize genomes. Further, I also experimentally validated 92% (108/117) of the predicted translocations in A. thaliana using a genetic approach
Economic Genome Assembly from Low Coverage Illumina and Nanopore Data
Ongoing developments in genome sequencing have caused a fundamental paradigm shift in the field in recent years. With ever lower sequencing costs, projects are no longer limited by available raw data, but rather by computational demands. The high complexity of eukaryotic genomes in concordance with increasing data sizes creates unique demands on methods to assemble full genomes. We describe a new approach to assemble genomes from a combination of low-coverage short and long reads. LazyB starts from a bipartite overlap graph between long reads and restrictively filtered short-read unitigs, which are then reduced to a long-read overlap graph G. Instead of the more conventional approach of removing tips, bubbles, and other local features, LazyB stepwisely extracts subgraphs whose global properties approach a disjoint union of paths. First, a consistently oriented subgraph is extracted, which in a second step is reduced to a directed acyclic graph. In the next step, properties of proper interval graphs are used to extract contigs as maximum weight paths. These are translated into genomic sequences only in the final step. A prototype implementation of LazyB, entirely written in python, not only yields significantly more accurate assemblies of the yeast and fruit fly genomes compared to state-of-the-art pipelines but also requires much less computational effort. Our findings demonstrate a new low-cost method that enables the assembly of even large genomes with low computational effort
Validating Paired-End Read Alignments in Sequence Graphs
Graph based non-linear reference structures such as variation graphs and colored de Bruijn graphs enable incorporation of full genomic diversity within a population. However, transitioning from a simple string-based reference to graphs requires addressing many computational challenges, one of which concerns accurately mapping sequencing read sets to graphs. Paired-end Illumina sequencing is a commonly used sequencing platform in genomics, where the paired-end distance constraints allow disambiguation of repeats. Many recent works have explored provably good index-based and alignment-based strategies for mapping individual reads to graphs. However, validating distance constraints efficiently over graphs is not trivial, and existing sequence to graph mappers rely on heuristics. We introduce a mathematical formulation of the problem, and provide a new algorithm to solve it exactly. We take advantage of the high sparsity of reference graphs, and use sparse matrix-matrix multiplications (SpGEMM) to build an index which can be queried efficiently by a mapping algorithm for validating the distance constraints. Effectiveness of the algorithm is demonstrated using real reference graphs, including a human MHC variation graph, and a pan-genome de-Bruijn graph built using genomes of 20 B. anthracis strains. While the one-time indexing time can vary from a few minutes to a few hours using our algorithm, answering a million distance queries takes less than a second
Detecting Superbubbles in Assembly Graphs
We introduce a new concept of a subgraph class called a superbubble for
analyzing assembly graphs, and propose an efficient algorithm for detecting it.
Most assembly algorithms utilize assembly graphs like the de Bruijn graph or
the overlap graph constructed from reads. From these graphs, many assembly
algorithms first detect simple local graph structures (motifs), such as tips
and bubbles, mainly to find sequencing errors. These motifs are easy to detect,
but they are sometimes too simple to deal with more complex errors. The
superbubble is an extension of the bubble, which is also important for
analyzing assembly graphs. Though superbubbles are much more complex than
ordinary bubbles, we show that they can be efficiently enumerated. We propose
an average-case linear time algorithm (i.e., O(n+m) for a graph with n vertices
and m edges) for graphs with a reasonable model, though the worst-case time
complexity of our algorithm is quadratic (i.e., O(n(n+m))). Moreover, the
algorithm is practically very fast: Our experiments show that our algorithm
runs in reasonable time with a single CPU core even against a very large graph
of a whole human genome.Comment: Peer-reviewed and presented as part of the 13th Workshop on
Algorithms in Bioinformatics (WABI2013
New Algorithms for Fast and Economic Assembly: Advances in Transcriptome and Genome Assembly
Great efforts have been devoted to decipher the sequence composition of
the genomes and transcriptomes of diverse organisms. Continuing advances in
high-throughput sequencing technologies have led to a decline in associated
costs, facilitating a rapid increase in the amount of available genetic data. In
particular genome studies have undergone a fundamental paradigm shift where
genome projects are no longer limited by sequencing costs, but rather by
computational problems associated with assembly. There is an urgent demand
for more efficient and more accurate methods. Most recently, “hybrid”
methods that integrate short- and long-read data have been devised to address
this need. LazyB is a new, low-cost hybrid genome assembler. It starts from a
bipartite overlap graph between long reads and restrictively filtered short-read
unitigs. This graph is translated into a long-read overlap graph. By design,
unitigs are both unique and almost free of assembly errors. As a consequence,
only few spurious overlaps are introduced into the graph. Instead of the more
conventional approach of removing tips, bubbles, and other local features,
LazyB extracts subgraphs whose global properties approach a disjoint union of
paths in multiple steps, utilizing properties of proper interval graphs. A
prototype implementation of LazyB, entirely written in Python, not only yields
significantly more accurate assemblies of the yeast, fruit fly, and human
genomes compared to state-of-the-art pipelines, but also requires much less
computational effort. An optimized C++ implementation dubbed MuCHSALSA
further significantly reduces resource demands.
Advances in RNA-seq have facilitated tremendous insights into the role of
both coding and non-coding transcripts. Yet, the complete and accurate
annotation of the transciptomes of even model organisms has remained elusive.
RNA-seq produces reads significantly shorter than the average distance
between related splice events and presents high noise levels and other biases
The computational reconstruction remains a critical bottleneck.
Ryūtō implements an extension of common splice graphs facilitating the integration
of reads spanning multiple splice sites and paired-end reads bridging distant
transcript parts. The decomposition of read coverage patterns is modeled as a
minimum-cost flow problem. Using phasing information from multi-splice and
paired-end reads, nodes with uncertain connections are decomposed step-wise
via Linear Programming.
Ryūtōs performance compares favorably with
state-of-the-art methods on both simulated and real-life datasets. Despite
ongoing research and our own contributions, progress on traditional single
sample assembly has brought no major breakthrough. Multi-sample RNA-Seq
experiments provide more information which, however, is challenging to utilize
due to the large amount of accumulating errors. An extension to Ryūtō
enables the reconstruction of consensus transcriptomes from multiple RNA-seq
data sets, incorporating consensus calling at low level features. Benchmarks
show stable improvements already at 3 replicates.
Ryūtō outperforms competing approaches, providing a better and user-adjustable
sensitivity-precision trade-off. Ryūtō consistently improves assembly on
replicates, demonstrable also when mixing conditions or time series and for
differential expression analysis. Ryūtōs approach towards guided assembly is
equally unique. It allows users to adjust results based on the quality of the
guide, even for multi-sample assembly.:1 Preface
1.1 Assembly: A vast and fast evolving field
1.2 Structure of this Work
1.3 Available
2 Introduction
2.1 Mathematical Background
2.2 High-Throughput Sequencing
2.3 Assembly
2.4 Transcriptome Expression
3 From LazyB to MuCHSALSA - Fast and Cheap Genome Assembly
3.1 Background
3.2 Strategy
3.3 Data preprocessing
3.4 Processing of the overlap graph
3.5 Post Processing of the Path Decomposition
3.6 Benchmarking
3.7 MuCHSALSA – Moving towards the future
4 Ryūtō - Versatile, Fast, and Effective Transcript Assembly
4.1 Background
4.2 Strategy
4.3 The Ryūtō core algorithm
4.4 Improved Multi-sample transcript assembly with Ryūtō
5 Conclusion & Future Work
5.1 Discussion and Outlook
5.2 Summary and Conclusio
Co-Linear Chaining on Pangenome Graphs
Pangenome reference graphs are useful in genomics because they compactly represent the genetic diversity within a species, a capability that linear references lack. However, efficiently aligning sequences to these graphs with complex topology and cycles can be challenging. The seed-chain-extend based alignment algorithms use co-linear chaining as a standard technique to identify a good cluster of exact seed matches that can be combined to form an alignment. Recent works show how the co-linear chaining problem can be efficiently solved for acyclic pangenome graphs by exploiting their small width [Makinen et al., TALG\u2719] and how incorporating gap cost in the scoring function improves alignment accuracy [Chandra and Jain, RECOMB\u2723]. However, it remains open on how to effectively generalize these techniques for general pangenome graphs which contain cycles. Here we present the first practical formulation and an exact algorithm for co-linear chaining on cyclic pangenome graphs. We rigorously prove the correctness and computational complexity of the proposed algorithm. We evaluate the empirical performance of our algorithm by aligning simulated long reads from the human genome to a cyclic pangenome graph constructed from 95 publicly available haplotype-resolved human genome assemblies. While the existing heuristic-based algorithms are faster, the proposed algorithm provides a significant advantage in terms of accuracy
- …