6,431 research outputs found
Accurate multiple sequence-structure alignment of RNA sequences using combinatorial optimization
Background: The discovery of functional non-coding RNA sequences has led to an increasing interest in algorithms related to RNA analysis. Traditional sequence alignment algorithms, however, fail at computing reliable alignments of low-homology RNA sequences. The spatial conformation of RNA sequences largely determines their function, and therefore RNA alignment algorithms have to take structural information into account. Results: We present a graph-based representation for sequence-structure alignments, which we model as an integer linear program (ILP). We sketch how we compute an optimal or near-optimal solution to the ILP using methods from combinatorial optimization, and present results on a recently published benchmark set for RNA alignments. Conclusions: The implementation of our algorithm yields better alignments in terms of two published scores than the other programs that we tested: This is especially the case with an increasing number of inpu
Geometric combinatorics and computational molecular biology: branching polytopes for RNA sequences
Questions in computational molecular biology generate various discrete
optimization problems, such as DNA sequence alignment and RNA secondary
structure prediction. However, the optimal solutions are fundamentally
dependent on the parameters used in the objective functions. The goal of a
parametric analysis is to elucidate such dependencies, especially as they
pertain to the accuracy and robustness of the optimal solutions. Techniques
from geometric combinatorics, including polytopes and their normal fans, have
been used previously to give parametric analyses of simple models for DNA
sequence alignment and RNA branching configurations. Here, we present a new
computational framework, and proof-of-principle results, which give the first
complete parametric analysis of the branching portion of the nearest neighbor
thermodynamic model for secondary structure prediction for real RNA sequences.Comment: 17 pages, 8 figure
A Haystack Heuristic for Autoimmune Disease Biomarker Discovery Using Next-Gen Immune Repertoire Sequencing Data.
Large-scale DNA sequencing of immunological repertoires offers an opportunity for the discovery of novel biomarkers for autoimmune disease. Available bioinformatics techniques however, are not adequately suited for elucidating possible biomarker candidates from within large immunosequencing datasets due to unsatisfactory scalability and sensitivity. Here, we present the Haystack Heuristic, an algorithm customized to computationally extract disease-associated motifs from next-generation-sequenced repertoires by contrasting disease and healthy subjects. This technique employs a local-search graph-theory approach to discover novel motifs in patient data. We apply the Haystack Heuristic to nine million B-cell receptor sequences obtained from nearly 100 individuals in order to elucidate a new motif that is significantly associated with multiple sclerosis. Our results demonstrate the effectiveness of the Haystack Heuristic in computing possible biomarker candidates from high throughput sequencing data and could be generalized to other datasets
RNA secondary structure prediction from multi-aligned sequences
It has been well accepted that the RNA secondary structures of most
functional non-coding RNAs (ncRNAs) are closely related to their functions and
are conserved during evolution. Hence, prediction of conserved secondary
structures from evolutionarily related sequences is one important task in RNA
bioinformatics; the methods are useful not only to further functional analyses
of ncRNAs but also to improve the accuracy of secondary structure predictions
and to find novel functional RNAs from the genome. In this review, I focus on
common secondary structure prediction from a given aligned RNA sequence, in
which one secondary structure whose length is equal to that of the input
alignment is predicted. I systematically review and classify existing tools and
algorithms for the problem, by utilizing the information employed in the tools
and by adopting a unified viewpoint based on maximum expected gain (MEG)
estimators. I believe that this classification will allow a deeper
understanding of each tool and provide users with useful information for
selecting tools for common secondary structure predictions.Comment: A preprint of an invited review manuscript that will be published in
a chapter of the book `Methods in Molecular Biology'. Note that this version
of the manuscript may differ from the published versio
Viral population estimation using pyrosequencing
The diversity of virus populations within single infected hosts presents a
major difficulty for the natural immune response as well as for vaccine design
and antiviral drug therapy. Recently developed pyrophosphate based sequencing
technologies (pyrosequencing) can be used for quantifying this diversity by
ultra-deep sequencing of virus samples. We present computational methods for
the analysis of such sequence data and apply these techniques to pyrosequencing
data obtained from HIV populations within patients harboring drug resistant
virus strains. Our main result is the estimation of the population structure of
the sample from the pyrosequencing reads. This inference is based on a
statistical approach to error correction, followed by a combinatorial algorithm
for constructing a minimal set of haplotypes that explain the data. Using this
set of explaining haplotypes, we apply a statistical model to infer the
frequencies of the haplotypes in the population via an EM algorithm. We
demonstrate that pyrosequencing reads allow for effective population
reconstruction by extensive simulations and by comparison to 165 sequences
obtained directly from clonal sequencing of four independent, diverse HIV
populations. Thus, pyrosequencing can be used for cost-effective estimation of
the structure of virus populations, promising new insights into viral
evolutionary dynamics and disease control strategies.Comment: 23 pages, 13 figure
The Mathematics of Phylogenomics
The grand challenges in biology today are being shaped by powerful
high-throughput technologies that have revealed the genomes of many organisms,
global expression patterns of genes and detailed information about variation
within populations. We are therefore able to ask, for the first time,
fundamental questions about the evolution of genomes, the structure of genes
and their regulation, and the connections between genotypes and phenotypes of
individuals. The answers to these questions are all predicated on progress in a
variety of computational, statistical, and mathematical fields.
The rapid growth in the characterization of genomes has led to the
advancement of a new discipline called Phylogenomics. This discipline results
from the combination of two major fields in the life sciences: Genomics, i.e.,
the study of the function and structure of genes and genomes; and Molecular
Phylogenetics, i.e., the study of the hierarchical evolutionary relationships
among organisms and their genomes. The objective of this article is to offer
mathematicians a first introduction to this emerging field, and to discuss
specific mathematical problems and developments arising from phylogenomics.Comment: 41 pages, 4 figure
An exact mathematical programming approach to multiple RNA sequence-structure alignment
One of the main tasks in computational biology is the computation of
alignments of genomic sequences to reveal their commonalities. In case of DNA
or protein sequences, sequence information alone is usually sufficient to
compute reliable alignments. RNA molecules, however, build spatial
conformations—the secondary structure—that are more conserved than the actual
sequence. Hence, computing reliable alignments of RNA molecules has to take
into account the secondary structure. We present a novel framework for the
computation of exact multiple sequence-structure alignments: We give a graph-
theoretic representation of the sequence-structure alignment problem and
phrase it as an integer linear program. We identify a class of constraints
that make the problem easier to solve and relax the original integer linear
program in a Lagrangian manner. Experiments on a recently published benchmark
show that our algorithms has a comparable performance than more costly dynamic
programming algorithms, and outperforms all other approaches in terms of
solution quality with an increasing number of input sequences
- …