220 research outputs found
The EM Algorithm and the Rise of Computational Biology
In the past decade computational biology has grown from a cottage industry
with a handful of researchers to an attractive interdisciplinary field,
catching the attention and imagination of many quantitatively-minded
scientists. Of interest to us is the key role played by the EM algorithm during
this transformation. We survey the use of the EM algorithm in a few important
computational biology problems surrounding the "central dogma"; of molecular
biology: from DNA to RNA and then to proteins. Topics of this article include
sequence motif discovery, protein sequence alignment, population genetics,
evolutionary models and mRNA expression microarray data analysis.Comment: Published in at http://dx.doi.org/10.1214/09-STS312 the Statistical
Science (http://www.imstat.org/sts/) by the Institute of Mathematical
Statistics (http://www.imstat.org
An Efficient Alignment Algorithm for Searching Simple Pseudoknots over Long Genomic Sequence
published_or_final_versio
A list of parameterized problems in bioinformatics
In this report we present a list of problems that originated in bionformatics. Our aim is to collect information on such problems that have been analyzed from the point of view of Parameterized Complexity. For every problem we give its definition and biological motivation together with known complexity results.Postprint (published version
A Probabilistic Model of Local Sequence Alignment That Simplifies Statistical Significance Estimation
Sequence database searches require accurate estimation of the statistical significance of scores. Optimal local sequence alignment scores follow Gumbel distributions, but determining an important parameter of the distribution (Îť) requires time-consuming computational simulation. Moreover, optimal alignment scores are less powerful than probabilistic scores that integrate over alignment uncertainty (âForwardâ scores), but the expected distribution of Forward scores remains unknown. Here, I conjecture that both expected score distributions have simple, predictable forms when full probabilistic modeling methods are used. For a probabilistic model of local sequence alignment, optimal alignment bit scores (âViterbiâ scores) are Gumbel-distributed with constant Îťâ=âlog 2, and the high scoring tail of Forward scores is exponential with the same constant Îť. Simulation studies support these conjectures over a wide range of profile/sequence comparisons, using 9,318 profile-hidden Markov models from the Pfam database. This enables efficient and accurate determination of expectation values (E-values) for both Viterbi and Forward scores for probabilistic local alignments
Exploiting bounded signal flow for graph orientation based on cause-effect pairs
Background: We consider the following problem: Given an undirected network and a set of senderâreceiver pairs, direct all edges such that the maximum number of âsignal flows â defined by the pairs can be routed respecting edge directions. This problem has applications in understanding protein interaction based cell regulation mechanisms. Since this problem is NP-hard, research so far concentrated on polynomial-time approximation algorithms and tractable special cases. Results: We take the viewpoint of parameterized algorithmics and examine several parameters related to the maximum signal flow over vertices or edges. We provide several fixed-parameter tractability results, and in one case a sharp complexity dichotomy between a linear-time solvable case and a slightly more general NP-hard case. We examine the value of these parameters for several real-world network instances. Conclusions: Several biologically relevant special cases of the NP-hard problem can be solved to optimality. In this way, parameterized analysis yields both deeper insight into the computational complexity and practical solving strategies. Background Current technologies [1] like two-hybrid screening ca
Dynamic-Backbone Protein-Ligand Structure Prediction with Multiscale Generative Diffusion Models
Molecular complexes formed by proteins and small-molecule ligands are
ubiquitous, and predicting their 3D structures can facilitate both biological
discoveries and the design of novel enzymes or drug molecules. Here we propose
NeuralPLexer, a deep generative model framework to rapidly predict
protein-ligand complex structures and their fluctuations using protein backbone
template and molecular graph inputs. NeuralPLexer jointly samples protein and
small-molecule 3D coordinates at an atomistic resolution through a generative
model that incorporates biophysical constraints and inferred proximity
information into a time-truncated diffusion process. The reverse-time
generative diffusion process is learned by a novel stereochemistry-aware
equivariant graph transformer that enables efficient, concurrent gradient field
prediction for all heavy atoms in the protein-ligand complex. NeuralPLexer
outperforms existing physics-based and learning-based methods on benchmarking
problems including fixed-backbone blind protein-ligand docking and
ligand-coupled binding site repacking. Moreover, we identify preliminary
evidence that NeuralPLexer enriches bound-state-like protein structures when
applied to systems where protein folding landscapes are significantly altered
by the presence of ligands. Our results reveal that a data-driven approach can
capture the structural cooperativity among protein and small-molecule entities,
showing promise for the computational identification of novel drug targets and
the end-to-end differentiable design of functional small-molecules and
ligand-binding proteins
Supervised Detection of Conserved Motifs in DNA Sequences with cosmo
A number of computational methods have been proposed for identifying transcription factor binding sites from a set of unaligned sequences that are thought to share the motif in question. We here introduce an algorithm, called cosmo, that allows this search to be supervised by specifying a set of constraints that the position weight matrix of the unknown motif must satisfy. Such constraints may be formulated, for example, on the basis of prior knowledge about the structure of the transcription factor in question. The algorithm is based on the same two-component multinomial mixture model used by MEME, with stronger reliance, however, on the likelihood principle instead of more ad-hoc criteria like the E-value. The intensity parameter in the ZOOPS and TCM models, for instance, is estimated based on a profile-likelihood approach, and the width of the unknown motif is selected based on BIC. These changes allow cosmo to outperform MEME even in the absence of any constraints, as evidenced by 2- to 3-fold greater sensitivity in some simulation studies. Additional improvements in performance can be achieved by selecting the model type (OOPS, ZOOPS, or TCM) data-adaptively or by supplying correctly specified constraints, especially if the motif appears only as a weak signal in the data. The algorithm can data-adaptively choose between working in a given constrained model or in the completely unconstrained model, guarding against the risk of supplying mis-specified constraints. Simulation studies suggest that this approach can offer 3 to 3.5 times greater sensitivity than MEME. The algorithm has been implemented in the form of a stand-alone C program as well as a web application that can be accessed at http://cosmoweb.berkeley.edu. An R package is available through Bioconductor (http://bioconductor.org)
Recommended from our members
The EM Algorithm and the Rise of Computational Biology
In the past decade computational biology has grown from a cottage industry with a handful of researchers to an attractive interdisciplinary field, catching the attention and imagination of many quantitatively-minded scientists. Of interest to us is the key role played by the EM algorithm during this transformation. We survey the use of the EM algorithm in a few important computational biology problems surrounding the âcentral dogmaâ of molecular biology: from DNA to RNA and then to proteins. Topics of this article include sequence motif discovery, protein sequence alignment, population genetics, evolutionary models and mRNA expression microarray data analysis.Statistic
- âŚ