9,405 research outputs found
Thermodynamic Analysis of Interacting Nucleic Acid Strands
Motivated by the analysis of natural and engineered DNA and RNA systems, we present the first algorithm for calculating the partition function of an unpseudoknotted complex of multiple interacting nucleic acid strands. This dynamic program is based on a rigorous extension of secondary structure models to the multistranded case, addressing representation and distinguishability issues that do not arise for single-stranded structures. We then derive the form of the partition function for a fixed volume containing a dilute solution of nucleic acid complexes. This expression can be evaluated explicitly for small numbers of strands, allowing the calculation of the equilibrium population distribution for each species of complex. Alternatively, for large systems (e.g., a test tube), we show that the unique complex concentrations corresponding to thermodynamic equilibrium can be obtained by solving a convex programming problem. Partition function and concentration information can then be used to calculate equilibrium base-pairing observables. The underlying physics and mathematical formulation of these problems lead to an interesting blend of approaches, including ideas from graph theory, group theory, dynamic programming, combinatorics, convex optimization, and Lagrange duality
Sparse logistic principal components analysis for binary data
We develop a new principal components analysis (PCA) type dimension reduction
method for binary data. Different from the standard PCA which is defined on the
observed data, the proposed PCA is defined on the logit transform of the
success probabilities of the binary observations. Sparsity is introduced to the
principal component (PC) loading vectors for enhanced interpretability and more
stable extraction of the principal components. Our sparse PCA is formulated as
solving an optimization problem with a criterion function motivated from a
penalized Bernoulli likelihood. A Majorization--Minimization algorithm is
developed to efficiently solve the optimization problem. The effectiveness of
the proposed sparse logistic PCA method is illustrated by application to a
single nucleotide polymorphism data set and a simulation study.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS327 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Multiscale Bernstein polynomials for densities
Our focus is on constructing a multiscale nonparametric prior for densities.
The Bayes density estimation literature is dominated by single scale methods,
with the exception of Polya trees, which favor overly-spiky densities even when
the truth is smooth. We propose a multiscale Bernstein polynomial family of
priors, which produce smooth realizations that do not rely on hard partitioning
of the support. At each level in an infinitely-deep binary tree, we place a
beta dictionary density; within a scale the densities are equivalent to
Bernstein polynomials. Using a stick-breaking characterization, stochastically
decreasing weights are allocated to the finer scale dictionary elements. A
slice sampler is used for posterior computation, and properties are described.
The method characterizes densities with locally-varying smoothness, and can
produce a sequence of coarse to fine density estimates. An extension for
Bayesian testing of group differences is introduced and applied to DNA
methylation array data
Laplacian Mixture Modeling for Network Analysis and Unsupervised Learning on Graphs
Laplacian mixture models identify overlapping regions of influence in
unlabeled graph and network data in a scalable and computationally efficient
way, yielding useful low-dimensional representations. By combining Laplacian
eigenspace and finite mixture modeling methods, they provide probabilistic or
fuzzy dimensionality reductions or domain decompositions for a variety of input
data types, including mixture distributions, feature vectors, and graphs or
networks. Provable optimal recovery using the algorithm is analytically shown
for a nontrivial class of cluster graphs. Heuristic approximations for scalable
high-performance implementations are described and empirically tested.
Connections to PageRank and community detection in network analysis demonstrate
the wide applicability of this approach. The origins of fuzzy spectral methods,
beginning with generalized heat or diffusion equations in physics, are reviewed
and summarized. Comparisons to other dimensionality reduction and clustering
methods for challenging unsupervised machine learning problems are also
discussed.Comment: 13 figures, 35 reference
Binding Site Graphs: A New Graph Theoretical Framework for Prediction of Transcription Factor Binding Sites
Computational prediction of nucleotide binding specificity for transcription factors remains a fundamental and largely unsolved problem. Determination of binding positions is a prerequisite for research in gene regulation, a major mechanism controlling phenotypic diversity. Furthermore, an accurate determination of binding specificities from high-throughput data sources is necessary to realize the full potential of systems biology. Unfortunately, recently performed independent evaluation showed that more than half the predictions from most widely used algorithms are false. We introduce a graph-theoretical framework to describe local sequence similarity as the pair-wise distances between nucleotides in promoter sequences, and hypothesize that densely connected subgraphs are indicative of transcription factor binding sites. Using a well-established sampling algorithm coupled with simple clustering and scoring schemes, we identify sets of closely related nucleotides and test those for known TF binding activity. Using an independent benchmark, we find our algorithm predicts yeast binding motifs considerably better than currently available techniques and without manual curation. Importantly, we reduce the number of false positive predictions in yeast to less than 30%. We also develop a framework to evaluate the statistical significance of our motif predictions. We show that our approach is robust to the choice of input promoters, and thus can be used in the context of predicting binding positions from noisy experimental data. We apply our method to identify binding sites using data from genome scale ChIPāchip experiments. Results from these experiments are publicly available at http://cagt10.bu.edu/BSG. The graphical framework developed here may be useful when combining predictions from numerous computational and experimental measures. Finally, we discuss how our algorithm can be used to improve the sensitivity of computational predictions of transcription factor binding specificities
- ā¦