2,809 research outputs found
Global Alignment of Molecular Sequences via Ancestral State Reconstruction
Molecular phylogenetic techniques do not generally account for such common
evolutionary events as site insertions and deletions (known as indels). Instead
tree building algorithms and ancestral state inference procedures typically
rely on substitution-only models of sequence evolution. In practice these
methods are extended beyond this simplified setting with the use of heuristics
that produce global alignments of the input sequences--an important problem
which has no rigorous model-based solution. In this paper we consider a new
version of the multiple sequence alignment in the context of stochastic indel
models. More precisely, we introduce the following {\em trace reconstruction
problem on a tree} (TRPT): a binary sequence is broadcast through a tree
channel where we allow substitutions, deletions, and insertions; we seek to
reconstruct the original sequence from the sequences received at the leaves of
the tree. We give a recursive procedure for this problem with strong
reconstruction guarantees at low mutation rates, providing also an alignment of
the sequences at the leaves of the tree. The TRPT problem without indels has
been studied in previous work (Mossel 2004, Daskalakis et al. 2006) as a
bootstrapping step towards obtaining optimal phylogenetic reconstruction
methods. The present work sets up a framework for extending these works to
evolutionary models with indels
Multiscale likelihood analysis and complexity penalized estimation
We describe here a framework for a certain class of multiscale likelihood
factorizations wherein, in analogy to a wavelet decomposition of an L^2
function, a given likelihood function has an alternative representation as a
product of conditional densities reflecting information in both the data and
the parameter vector localized in position and scale. The framework is
developed as a set of sufficient conditions for the existence of such
factorizations, formulated in analogy to those underlying a standard
multiresolution analysis for wavelets, and hence can be viewed as a
multiresolution analysis for likelihoods. We then consider the use of these
factorizations in the task of nonparametric, complexity penalized likelihood
estimation. We study the risk properties of certain thresholding and
partitioning estimators, and demonstrate their adaptivity and near-optimality,
in a minimax sense over a broad range of function spaces, based on squared
Hellinger distance as a loss function. In particular, our results provide an
illustration of how properties of classical wavelet-based estimators can be
obtained in a single, unified framework that includes models for continuous,
count and categorical data types
A Knowledge Gradient Policy for Sequencing Experiments to Identify the Structure of RNA Molecules Using a Sparse Additive Belief Model
We present a sparse knowledge gradient (SpKG) algorithm for adaptively
selecting the targeted regions within a large RNA molecule to identify which
regions are most amenable to interactions with other molecules. Experimentally,
such regions can be inferred from fluorescence measurements obtained by binding
a complementary probe with fluorescence markers to the targeted regions. We use
a biophysical model which shows that the fluorescence ratio under the log scale
has a sparse linear relationship with the coefficients describing the
accessibility of each nucleotide, since not all sites are accessible (due to
the folding of the molecule). The SpKG algorithm uniquely combines the Bayesian
ranking and selection problem with the frequentist regularized
regression approach Lasso. We use this algorithm to identify the sparsity
pattern of the linear model as well as sequentially decide the best regions to
test before experimental budget is exhausted. Besides, we also develop two
other new algorithms: batch SpKG algorithm, which generates more suggestions
sequentially to run parallel experiments; and batch SpKG with a procedure which
we call length mutagenesis. It dynamically adds in new alternatives, in the
form of types of probes, are created by inserting, deleting or mutating
nucleotides within existing probes. In simulation, we demonstrate these
algorithms on the Group I intron (a mid-size RNA molecule), showing that they
efficiently learn the correct sparsity pattern, identify the most accessible
region, and outperform several other policies
- …