64,424 research outputs found
Statistical properties of neutral evolution
Neutral evolution is the simplest model of molecular evolution and thus it is
most amenable to a comprehensive theoretical investigation. In this paper, we
characterize the statistical properties of neutral evolution of proteins under
the requirement that the native state remains thermodynamically stable, and
compare them to the ones of Kimura's model of neutral evolution. Our study is
based on the Structurally Constrained Neutral (SCN) model which we recently
proposed. We show that, in the SCN model, the substitution rate decreases as
longer time intervals are considered, and fluctuates strongly from one branch
of the evolutionary tree to another, leading to a non-Poissonian statistics for
the substitution process. Such strong fluctuations are also due to the fact
that neutral substitution rates for individual residues are strongly correlated
for most residue pairs. Interestingly, structurally conserved residues,
characterized by a much below average substitution rate, are also much less
correlated to other residues and evolve in a much more regular way. Our results
could improve methods aimed at distinguishing between neutral and adaptive
substitutions as well as methods for computing the expected number of
substitutions occurred since the divergence of two protein sequences.Comment: 17 pages, 11 figure
Prediction of Atomization Energy Using Graph Kernel and Active Learning
Data-driven prediction of molecular properties presents unique challenges to
the design of machine learning methods concerning data
structure/dimensionality, symmetry adaption, and confidence management. In this
paper, we present a kernel-based pipeline that can learn and predict the
atomization energy of molecules with high accuracy. The framework employs
Gaussian process regression to perform predictions based on the similarity
between molecules, which is computed using the marginalized graph kernel. To
apply the marginalized graph kernel, a spatial adjacency rule is first employed
to convert molecules into graphs whose vertices and edges are labeled by
elements and interatomic distances, respectively. We then derive formulas for
the efficient evaluation of the kernel. Specific functional components for the
marginalized graph kernel are proposed, while the effect of the associated
hyperparameters on accuracy and predictive confidence are examined. We show
that the graph kernel is particularly suitable for predicting extensive
properties because its convolutional structure coincides with that of the
covariance formula between sums of random variables. Using an active learning
procedure, we demonstrate that the proposed method can achieve a mean absolute
error of 0.62 +- 0.01 kcal/mol using as few as 2000 training samples on the QM7
data set
Graph-Based Change-Point Detection
We consider the testing and estimation of change-points -- locations where
the distribution abruptly changes -- in a data sequence. A new approach, based
on scan statistics utilizing graphs representing the similarity between
observations, is proposed. The graph-based approach is non-parametric, and can
be applied to any data set as long as an informative similarity measure on the
sample space can be defined. Accurate analytic approximations to the
significance of graph-based scan statistics for both the single change-point
and the changed interval alternatives are provided. Simulations reveal that the
new approach has better power than existing approaches when the dimension of
the data is moderate to high. The new approach is illustrated on two
applications: The determination of authorship of a classic novel, and the
detection of change in a network over time
- …