122,047 research outputs found
Information content based model for the topological properties of the gene regulatory network of Escherichia coli
Gene regulatory networks (GRN) are being studied with increasingly precise
quantitative tools and can provide a testing ground for ideas regarding the
emergence and evolution of complex biological networks. We analyze the global
statistical properties of the transcriptional regulatory network of the
prokaryote Escherichia coli, identifying each operon with a node of the
network. We propose a null model for this network using the content-based
approach applied earlier to the eukaryote Saccharomyces cerevisiae. (Balcan et
al., 2007) Random sequences that represent promoter regions and binding
sequences are associated with the nodes. The length distributions of these
sequences are extracted from the relevant databases. The network is constructed
by testing for the occurrence of binding sequences within the promoter regions.
The ensemble of emergent networks yields an exponentially decaying in-degree
distribution and a putative power law dependence for the out-degree
distribution with a flat tail, in agreement with the data. The clustering
coefficient, degree-degree correlation, rich club coefficient and k-core
visualization all agree qualitatively with the empirical network to an extent
not yet achieved by any other computational model, to our knowledge. The
significant statistical differences can point the way to further research into
non-adaptive and adaptive processes in the evolution of the E. coli GRN.Comment: 58 pages, 3 tables, 22 figures. In press, Journal of Theoretical
Biology (2009)
Multiple tests of association with biological annotation metadata
We propose a general and formal statistical framework for multiple tests of
association between known fixed features of a genome and unknown parameters of
the distribution of variable features of this genome in a population of
interest. The known gene-annotation profiles, corresponding to the fixed
features of the genome, may concern Gene Ontology (GO) annotation, pathway
membership, regulation by particular transcription factors, nucleotide
sequences, or protein sequences. The unknown gene-parameter profiles,
corresponding to the variable features of the genome, may be, for example,
regression coefficients relating possibly censored biological and clinical
outcomes to genome-wide transcript levels, DNA copy numbers, and other
covariates. A generic question of great interest in current genomic research
regards the detection of associations between biological annotation metadata
and genome-wide expression measures. This biological question may be translated
as the test of multiple hypotheses concerning association measures between
gene-annotation profiles and gene-parameter profiles. A general and rigorous
formulation of the statistical inference question allows us to apply the
multiple hypothesis testing methodology developed in [Multiple Testing
Procedures with Applications to Genomics (2008) Springer, New York] and related
articles, to control a broad class of Type I error rates, defined as
generalized tail probabilities and expected values for arbitrary functions of
the numbers of Type I errors and rejected hypotheses. The resampling-based
single-step and stepwise multiple testing procedures of [Multiple Testing
Procedures with Applications to Genomics (2008) Springer, New York] take into
account the joint distribution of the test statistics and provide Type I error
control in testing problems involving general data generating distributions
(with arbitrary dependence structures among variables), null hypotheses, and
test statistics.Comment: Published in at http://dx.doi.org/10.1214/193940307000000446 the IMS
Collections (http://www.imstat.org/publications/imscollections.htm) by the
Institute of Mathematical Statistics (http://www.imstat.org
Identifying statistical dependence in genomic sequences via mutual information estimates
Questions of understanding and quantifying the representation and amount of
information in organisms have become a central part of biological research, as
they potentially hold the key to fundamental advances. In this paper, we
demonstrate the use of information-theoretic tools for the task of identifying
segments of biomolecules (DNA or RNA) that are statistically correlated. We
develop a precise and reliable methodology, based on the notion of mutual
information, for finding and extracting statistical as well as structural
dependencies. A simple threshold function is defined, and its use in
quantifying the level of significance of dependencies between biological
segments is explored. These tools are used in two specific applications. First,
for the identification of correlations between different parts of the maize
zmSRp32 gene. There, we find significant dependencies between the 5'
untranslated region in zmSRp32 and its alternatively spliced exons. This
observation may indicate the presence of as-yet unknown alternative splicing
mechanisms or structural scaffolds. Second, using data from the FBI's Combined
DNA Index System (CODIS), we demonstrate that our approach is particularly well
suited for the problem of discovering short tandem repeats, an application of
importance in genetic profiling.Comment: Preliminary version. Final version in EURASIP Journal on
Bioinformatics and Systems Biology. See http://www.hindawi.com/journals/bsb
Sequence Dependence of Transcription Factor-Mediated DNA Looping
DNA is subject to large deformations in a wide range of biological processes.
Two key examples illustrate how such deformations influence the readout of the
genetic information: the sequestering of eukaryotic genes by nucleosomes, and
DNA looping in transcriptional regulation in both prokaryotes and eukaryotes.
These kinds of regulatory problems are now becoming amenable to systematic
quantitative dissection with a powerful dialogue between theory and experiment.
Here we use a single-molecule experiment in conjunction with a statistical
mechanical model to test quantitative predictions for the behavior of DNA
looping at short length scales, and to determine how DNA sequence affects
looping at these lengths. We calculate and measure how such looping depends
upon four key biological parameters: the strength of the transcription factor
binding sites, the concentration of the transcription factor, and the length
and sequence of the DNA loop. Our studies lead to the surprising insight that
sequences that are thought to be especially favorable for nucleosome formation
because of high flexibility lead to no systematically detectable effect of
sequence on looping, and begin to provide a picture of the distinctions between
the short length scale mechanics of nucleosome formation and looping.Comment: Nucleic Acids Research (2012); Published version available at
http://nar.oxfordjournals.org/cgi/content/abstract/gks473?
ijkey=6m5pPVJgsmNmbof&keytype=re
Selection of sequence motifs and generative Hopfield-Potts models for protein familiesilies
Statistical models for families of evolutionary related proteins have
recently gained interest: in particular pairwise Potts models, as those
inferred by the Direct-Coupling Analysis, have been able to extract information
about the three-dimensional structure of folded proteins, and about the effect
of amino-acid substitutions in proteins. These models are typically requested
to reproduce the one- and two-point statistics of the amino-acid usage in a
protein family, {\em i.e.}~to capture the so-called residue conservation and
covariation statistics of proteins of common evolutionary origin. Pairwise
Potts models are the maximum-entropy models achieving this. While being
successful, these models depend on huge numbers of {\em ad hoc} introduced
parameters, which have to be estimated from finite amount of data and whose
biophysical interpretation remains unclear. Here we propose an approach to
parameter reduction, which is based on selecting collective sequence motifs. It
naturally leads to the formulation of statistical sequence models in terms of
Hopfield-Potts models. These models can be accurately inferred using a mapping
to restricted Boltzmann machines and persistent contrastive divergence. We show
that, when applied to protein data, even 20-40 patterns are sufficient to
obtain statistically close-to-generative models. The Hopfield patterns form
interpretable sequence motifs and may be used to clusterize amino-acid
sequences into functional sub-families. However, the distributed collective
nature of these motifs intrinsically limits the ability of Hopfield-Potts
models in predicting contact maps, showing the necessity of developing models
going beyond the Hopfield-Potts models discussed here.Comment: 26 pages, 16 figures, to app. in PR
Comparative Analysis of the Saccharomyces cerevisiae and Caenorhabditis elegans Protein Interaction Network
Protein interaction networks aim to summarize the complex interplay of
proteins in an organism. Early studies suggested that the position of a protein
in the network determines its evolutionary rate but there has been considerable
disagreement as to what extent other factors, such as protein abundance, modify
this reported dependence.
We compare the genomes of Saccharomyces cerevisiae and Caenorhabditis elegans
with those of closely related species to elucidate the recent evolutionary
history of their respective protein interaction networks. Interaction and
expression data are studied in the light of a detailed phylogenetic analysis.
The underlying network structure is incorporated explicitly into the
statistical analysis.
The increased phylogenetic resolution, paired with high-quality interaction
data, allows us to resolve the way in which protein interaction network
structure and abundance of proteins affect the evolutionary rate. We find that
expression levels are better predictors of the evolutionary rate than a
protein's connectivity. Detailed analysis of the two organisms also shows that
the evolutionary rates of interacting proteins are not sufficiently similar to
be mutually predictive.
It appears that meaningful inferences about the evolution of protein
interaction networks require comparative analysis of reasonably closely related
species. The signature of protein evolution is shaped by a protein's abundance
in the organism and its function and the biological process it is involved in.
Its position in the interaction networks and its connectivity may modulate this
but they appear to have only minor influence on a protein's evolutionary rate.Comment: Accepted for publication in BMC Evolutionary Biolog
Regular and stochastic behavior of Parkinsonian pathological tremor signals
Regular and stochastic behavior in the time series of Parkinsonian
pathological tremor velocity is studied on the basis of the statistical theory
of discrete non-Markov stochastic processes and flicker-noise spectroscopy. We
have developed a new method of analyzing and diagnosing Parkinson's disease
(PD) by taking into consideration discreteness, fluctuations, long- and
short-range correlations, regular and stochastic behavior, Markov and
non-Markov effects and dynamic alternation of relaxation modes in the initial
time signals. The spectrum of the statistical non-Markovity parameter reflects
Markovity and non-Markovity in the initial time series of tremor. The
relaxation and kinetic parameters used in the method allow us to estimate the
relaxation scales of diverse scenarios of the time signals produced by the
patient in various dynamic states. The local time behavior of the initial time
correlation function and the first point of the non-Markovity parameter give
detailed information about the variation of pathological tremor in the local
regions of the time series. The obtained results can be used to find the most
effective method of reducing or suppressing pathological tremor in each
individual case of a PD patient. Generally, the method allows one to assess the
efficacy of the medical treatment for a group of PD patients.Comment: 39 pages, 10 figures, 1 table Physica A, in pres
- âŠ