122,047 research outputs found

    Information content based model for the topological properties of the gene regulatory network of Escherichia coli

    Full text link
    Gene regulatory networks (GRN) are being studied with increasingly precise quantitative tools and can provide a testing ground for ideas regarding the emergence and evolution of complex biological networks. We analyze the global statistical properties of the transcriptional regulatory network of the prokaryote Escherichia coli, identifying each operon with a node of the network. We propose a null model for this network using the content-based approach applied earlier to the eukaryote Saccharomyces cerevisiae. (Balcan et al., 2007) Random sequences that represent promoter regions and binding sequences are associated with the nodes. The length distributions of these sequences are extracted from the relevant databases. The network is constructed by testing for the occurrence of binding sequences within the promoter regions. The ensemble of emergent networks yields an exponentially decaying in-degree distribution and a putative power law dependence for the out-degree distribution with a flat tail, in agreement with the data. The clustering coefficient, degree-degree correlation, rich club coefficient and k-core visualization all agree qualitatively with the empirical network to an extent not yet achieved by any other computational model, to our knowledge. The significant statistical differences can point the way to further research into non-adaptive and adaptive processes in the evolution of the E. coli GRN.Comment: 58 pages, 3 tables, 22 figures. In press, Journal of Theoretical Biology (2009)

    Multiple tests of association with biological annotation metadata

    Full text link
    We propose a general and formal statistical framework for multiple tests of association between known fixed features of a genome and unknown parameters of the distribution of variable features of this genome in a population of interest. The known gene-annotation profiles, corresponding to the fixed features of the genome, may concern Gene Ontology (GO) annotation, pathway membership, regulation by particular transcription factors, nucleotide sequences, or protein sequences. The unknown gene-parameter profiles, corresponding to the variable features of the genome, may be, for example, regression coefficients relating possibly censored biological and clinical outcomes to genome-wide transcript levels, DNA copy numbers, and other covariates. A generic question of great interest in current genomic research regards the detection of associations between biological annotation metadata and genome-wide expression measures. This biological question may be translated as the test of multiple hypotheses concerning association measures between gene-annotation profiles and gene-parameter profiles. A general and rigorous formulation of the statistical inference question allows us to apply the multiple hypothesis testing methodology developed in [Multiple Testing Procedures with Applications to Genomics (2008) Springer, New York] and related articles, to control a broad class of Type I error rates, defined as generalized tail probabilities and expected values for arbitrary functions of the numbers of Type I errors and rejected hypotheses. The resampling-based single-step and stepwise multiple testing procedures of [Multiple Testing Procedures with Applications to Genomics (2008) Springer, New York] take into account the joint distribution of the test statistics and provide Type I error control in testing problems involving general data generating distributions (with arbitrary dependence structures among variables), null hypotheses, and test statistics.Comment: Published in at http://dx.doi.org/10.1214/193940307000000446 the IMS Collections (http://www.imstat.org/publications/imscollections.htm) by the Institute of Mathematical Statistics (http://www.imstat.org

    Identifying statistical dependence in genomic sequences via mutual information estimates

    Get PDF
    Questions of understanding and quantifying the representation and amount of information in organisms have become a central part of biological research, as they potentially hold the key to fundamental advances. In this paper, we demonstrate the use of information-theoretic tools for the task of identifying segments of biomolecules (DNA or RNA) that are statistically correlated. We develop a precise and reliable methodology, based on the notion of mutual information, for finding and extracting statistical as well as structural dependencies. A simple threshold function is defined, and its use in quantifying the level of significance of dependencies between biological segments is explored. These tools are used in two specific applications. First, for the identification of correlations between different parts of the maize zmSRp32 gene. There, we find significant dependencies between the 5' untranslated region in zmSRp32 and its alternatively spliced exons. This observation may indicate the presence of as-yet unknown alternative splicing mechanisms or structural scaffolds. Second, using data from the FBI's Combined DNA Index System (CODIS), we demonstrate that our approach is particularly well suited for the problem of discovering short tandem repeats, an application of importance in genetic profiling.Comment: Preliminary version. Final version in EURASIP Journal on Bioinformatics and Systems Biology. See http://www.hindawi.com/journals/bsb

    Sequence Dependence of Transcription Factor-Mediated DNA Looping

    Get PDF
    DNA is subject to large deformations in a wide range of biological processes. Two key examples illustrate how such deformations influence the readout of the genetic information: the sequestering of eukaryotic genes by nucleosomes, and DNA looping in transcriptional regulation in both prokaryotes and eukaryotes. These kinds of regulatory problems are now becoming amenable to systematic quantitative dissection with a powerful dialogue between theory and experiment. Here we use a single-molecule experiment in conjunction with a statistical mechanical model to test quantitative predictions for the behavior of DNA looping at short length scales, and to determine how DNA sequence affects looping at these lengths. We calculate and measure how such looping depends upon four key biological parameters: the strength of the transcription factor binding sites, the concentration of the transcription factor, and the length and sequence of the DNA loop. Our studies lead to the surprising insight that sequences that are thought to be especially favorable for nucleosome formation because of high flexibility lead to no systematically detectable effect of sequence on looping, and begin to provide a picture of the distinctions between the short length scale mechanics of nucleosome formation and looping.Comment: Nucleic Acids Research (2012); Published version available at http://nar.oxfordjournals.org/cgi/content/abstract/gks473? ijkey=6m5pPVJgsmNmbof&keytype=re

    Selection of sequence motifs and generative Hopfield-Potts models for protein familiesilies

    Full text link
    Statistical models for families of evolutionary related proteins have recently gained interest: in particular pairwise Potts models, as those inferred by the Direct-Coupling Analysis, have been able to extract information about the three-dimensional structure of folded proteins, and about the effect of amino-acid substitutions in proteins. These models are typically requested to reproduce the one- and two-point statistics of the amino-acid usage in a protein family, {\em i.e.}~to capture the so-called residue conservation and covariation statistics of proteins of common evolutionary origin. Pairwise Potts models are the maximum-entropy models achieving this. While being successful, these models depend on huge numbers of {\em ad hoc} introduced parameters, which have to be estimated from finite amount of data and whose biophysical interpretation remains unclear. Here we propose an approach to parameter reduction, which is based on selecting collective sequence motifs. It naturally leads to the formulation of statistical sequence models in terms of Hopfield-Potts models. These models can be accurately inferred using a mapping to restricted Boltzmann machines and persistent contrastive divergence. We show that, when applied to protein data, even 20-40 patterns are sufficient to obtain statistically close-to-generative models. The Hopfield patterns form interpretable sequence motifs and may be used to clusterize amino-acid sequences into functional sub-families. However, the distributed collective nature of these motifs intrinsically limits the ability of Hopfield-Potts models in predicting contact maps, showing the necessity of developing models going beyond the Hopfield-Potts models discussed here.Comment: 26 pages, 16 figures, to app. in PR

    Comparative Analysis of the Saccharomyces cerevisiae and Caenorhabditis elegans Protein Interaction Network

    Get PDF
    Protein interaction networks aim to summarize the complex interplay of proteins in an organism. Early studies suggested that the position of a protein in the network determines its evolutionary rate but there has been considerable disagreement as to what extent other factors, such as protein abundance, modify this reported dependence. We compare the genomes of Saccharomyces cerevisiae and Caenorhabditis elegans with those of closely related species to elucidate the recent evolutionary history of their respective protein interaction networks. Interaction and expression data are studied in the light of a detailed phylogenetic analysis. The underlying network structure is incorporated explicitly into the statistical analysis. The increased phylogenetic resolution, paired with high-quality interaction data, allows us to resolve the way in which protein interaction network structure and abundance of proteins affect the evolutionary rate. We find that expression levels are better predictors of the evolutionary rate than a protein's connectivity. Detailed analysis of the two organisms also shows that the evolutionary rates of interacting proteins are not sufficiently similar to be mutually predictive. It appears that meaningful inferences about the evolution of protein interaction networks require comparative analysis of reasonably closely related species. The signature of protein evolution is shaped by a protein's abundance in the organism and its function and the biological process it is involved in. Its position in the interaction networks and its connectivity may modulate this but they appear to have only minor influence on a protein's evolutionary rate.Comment: Accepted for publication in BMC Evolutionary Biolog

    Regular and stochastic behavior of Parkinsonian pathological tremor signals

    Full text link
    Regular and stochastic behavior in the time series of Parkinsonian pathological tremor velocity is studied on the basis of the statistical theory of discrete non-Markov stochastic processes and flicker-noise spectroscopy. We have developed a new method of analyzing and diagnosing Parkinson's disease (PD) by taking into consideration discreteness, fluctuations, long- and short-range correlations, regular and stochastic behavior, Markov and non-Markov effects and dynamic alternation of relaxation modes in the initial time signals. The spectrum of the statistical non-Markovity parameter reflects Markovity and non-Markovity in the initial time series of tremor. The relaxation and kinetic parameters used in the method allow us to estimate the relaxation scales of diverse scenarios of the time signals produced by the patient in various dynamic states. The local time behavior of the initial time correlation function and the first point of the non-Markovity parameter give detailed information about the variation of pathological tremor in the local regions of the time series. The obtained results can be used to find the most effective method of reducing or suppressing pathological tremor in each individual case of a PD patient. Generally, the method allows one to assess the efficacy of the medical treatment for a group of PD patients.Comment: 39 pages, 10 figures, 1 table Physica A, in pres
    • 

    corecore