15,673 research outputs found
A statistical physics perspective on alignment-independent protein sequence comparison.
Motivation: Within bioinformatics, the textual alignment of amino acid sequences has long dominated the determination of similarity between proteins, with all that implies for shared structure, function, and evolutionary descent. Despite the relative success of modern-day sequence alignment algorithms, so-called alignment-free approaches offer a complementary means of determining and expressing similarity, with potential benefits in certain key applications, such as regression analysis of protein structure-function studies, where alignment-base similarity has performed poorly. Results: Here, we offer a fresh, statistical physics-based perspective focusing on the question of alignment-free comparison, in the process adapting results from “first passage probability distribution” to summarize statistics of ensemble averaged amino acid propensity values. In this paper, we introduce and elaborate this approach
Equi-energy sampler with applications in statistical inference and statistical mechanics
We introduce a new sampling algorithm, the equi-energy sampler, for efficient
statistical sampling and estimation. Complementary to the widely used
temperature-domain methods, the equi-energy sampler, utilizing the
temperature--energy duality, targets the energy directly. The focus on the
energy function not only facilitates efficient sampling, but also provides a
powerful means for statistical estimation, for example, the calculation of the
density of states and microcanonical averages in statistical mechanics. The
equi-energy sampler is applied to a variety of problems, including exponential
regression in statistics, motif sampling in computational biology and protein
folding in biophysics.Comment: This paper discussed in: [math.ST/0611217], [math.ST/0611219],
[math.ST/0611221], [math.ST/0611222]. Rejoinder in [math.ST/0611224].
Published at http://dx.doi.org/10.1214/009053606000000515 in the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Improved contact prediction in proteins: Using pseudolikelihoods to infer Potts models
Spatially proximate amino acids in a protein tend to coevolve. A protein's
three-dimensional (3D) structure hence leaves an echo of correlations in the
evolutionary record. Reverse engineering 3D structures from such correlations
is an open problem in structural biology, pursued with increasing vigor as more
and more protein sequences continue to fill the data banks. Within this task
lies a statistical inference problem, rooted in the following: correlation
between two sites in a protein sequence can arise from firsthand interaction
but can also be network-propagated via intermediate sites; observed correlation
is not enough to guarantee proximity. To separate direct from indirect
interactions is an instance of the general problem of inverse statistical
mechanics, where the task is to learn model parameters (fields, couplings) from
observables (magnetizations, correlations, samples) in large systems. In the
context of protein sequences, the approach has been referred to as
direct-coupling analysis. Here we show that the pseudolikelihood method,
applied to 21-state Potts models describing the statistical properties of
families of evolutionarily related proteins, significantly outperforms existing
approaches to the direct-coupling analysis, the latter being based on standard
mean-field techniques. This improved performance also relies on a modified
score for the coupling strength. The results are verified using known crystal
structures of specific sequence instances of various protein families. Code
implementing the new method can be found at http://plmdca.csc.kth.se/.Comment: 19 pages, 16 figures, published versio
Identification of direct residue contacts in protein-protein interaction by message passing
Understanding the molecular determinants of specificity in protein-protein
interaction is an outstanding challenge of postgenome biology. The availability
of large protein databases generated from sequences of hundreds of bacterial
genomes enables various statistical approaches to this problem. In this context
covariance-based methods have been used to identify correlation between amino
acid positions in interacting proteins. However, these methods have an
important shortcoming, in that they cannot distinguish between directly and
indirectly correlated residues. We developed a method that combines covariance
analysis with global inference analysis, adopted from use in statistical
physics. Applied to a set of >2,500 representatives of the bacterial
two-component signal transduction system, the combination of covariance with
global inference successfully and robustly identified residue pairs that are
proximal in space without resorting to ad hoc tuning parameters, both for
heterointeractions between sensor kinase (SK) and response regulator (RR)
proteins and for homointeractions between RR proteins. The spectacular success
of this approach illustrates the effectiveness of the global inference approach
in identifying direct interaction based on sequence information alone. We
expect this method to be applicable soon to interaction surfaces between
proteins present in only 1 copy per genome as the number of sequenced genomes
continues to expand. Use of this method could significantly increase the
potential targets for therapeutic intervention, shed light on the mechanism of
protein-protein interaction, and establish the foundation for the accurate
prediction of interacting protein partners.Comment: Supplementary information available on
http://www.pnas.org/content/106/1/67.abstrac
Quantitative test of the barrier nucleosome model for statistical positioning of nucleosomes up- and downstream of transcription start sites
The positions of nucleosomes in eukaryotic genomes determine which parts of
the DNA sequence are readily accessible for regulatory proteins and which are
not. Genome-wide maps of nucleosome positions have revealed a salient pattern
around transcription start sites, involving a nucleosome-free region (NFR)
flanked by a pronounced periodic pattern in the average nucleosome density.
While the periodic pattern clearly reflects well-positioned nucleosomes, the
positioning mechanism is less clear. A recent experimental study by Mavrich et
al. argued that the pattern observed in S. cerevisiae is qualitatively
consistent with a `barrier nucleosome model', in which the oscillatory pattern
is created by the statistical positioning mechanism of Kornberg and Stryer. On
the other hand, there is clear evidence for intrinsic sequence preferences of
nucleosomes, and it is unclear to what extent these sequence preferences affect
the observed pattern. To test the barrier nucleosome model, we quantitatively
analyze yeast nucleosome positioning data both up- and downstream from NFRs.
Our analysis is based on the Tonks model of statistical physics which
quantifies the interplay between the excluded-volume interaction of nucleosomes
and their positional entropy. We find that although the typical patterns on the
two sides of the NFR are different, they are both quantitatively described by
the same physical model, with the same parameters, but different boundary
conditions. The inferred boundary conditions suggest that the first nucleosome
downstream from the NFR (the +1 nucleosome) is typically directly positioned
while the first nucleosome upstream is statistically positioned via a
nucleosome-repelling DNA region. These boundary conditions, which can be
locally encoded into the genome sequence, significantly shape the statistical
distribution of nucleosomes over a range of up to ~1000 bp to each side.Comment: includes supporting materia
Probabilistic methods in the analysis of protein interaction networks
Imperial Users onl
- …