69 research outputs found

    A study on the correlation of nucleotide skews and the positioning of the origin of replication: different modes of replication in bacterial species

    Get PDF
    Deviations from Chargaff's 2nd parity rule, according to which A∌T and G∌C in single stranded DNA, have been associated with replication as well as with transcription in prokaryotes. Based on observations regarding mainly the transcription-replication co-linearity in a large number of prokaryotic species, we formulate the hypothesis that the replication procedure may follow different modes between genomes throughout which the skews clearly follow different patterns. We draw the conclusion that multiple functional sites of origin of replication may exist in the genomes of most archaea and in some exceptional cases of eubacteria, while in the majority of eubacteria, replication occurs through a single fixed origin

    Optimal Computation of Overabundant Words

    Get PDF
    The observed frequency of the longest proper prefix, the longest proper suffix, and the longest infix of a word w in a given sequence x can be used for classifying w as avoided or overabundant. The definitions used for the expectation and deviation of w in this statistical model were described and biologically justified by Brendel et al. (J Biomol Struct Dyn 1986). We have very recently introduced a time-optimal algorithm for computing all avoided words of a given sequence over an integer alphabet (Algorithms Mol Biol 2017). In this article, we extend this study by presenting an O(n)-time and O(n)-space algorithm for computing all overabundant words in a sequence x of length n over an integer alphabet. Our main result is based on a new non-trivial combinatorial property of the suffix tree T of x: the number of distinct factors of x whose longest infix is the label of an explicit node of T is no more than 3n-4. We further show that the presented algorithm is time-optimal by proving that O(n) is a tight upper bound for the number of overabundant words. Finally, we present experimental results, using both synthetic and real data, which justify the effectiveness and efficiency of our approach in practical terms

    Classification of selectively constrained DNA elements using feature vectors and rule-based classifiers

    Get PDF
    Scarce work has been done in the analysis of the composition of conserved non-coding elements (CNEs) that are identified by comparisons of two or more genomes and are found to exist in all metazoan genomes. Here we present the analysis of CNEs with a methodology that takes into account word occurrence at various lengths scales in the form of feature vector representation and rule based classifiers. We implement our approach on both protein-coding exons and CNEs, originating from human, insect (Drosophila melanogaster) and worm (Caenorhabditis elegans) genomes, that are either identified in the present study or obtained from the literature. Alignment free feature vector representation of sequences combined with rule-based classification methods leads to successful classification of the different CNEs classes. Biologically meaningful results are derived by comparison with the genomic signatures approach, and classification rates for a variety of functional elements of the genomes along with surrogates are presented. (C) 2014 Elsevier Inc. All rights reserved

    Information decomposition of symbolic sequences

    Full text link
    We developed a non-parametric method of Information Decomposition (ID) of a content of any symbolical sequence. The method is based on the calculation of Shannon mutual information between analyzed and artificial symbolical sequences, and allows the revealing of latent periodicity in any symbolical sequence. We show the stability of the ID method in the case of a large number of random letter changes in an analyzed symbolic sequence. We demonstrate the possibilities of the method, analyzing both poems, and DNA and protein sequences. In DNA and protein sequences we show the existence of many DNA and amino acid sequences with different types and lengths of latent periodicity. The possible origin of latent periodicity for different symbolical sequences is discussed.Comment: 18 pages, 8 figure

    Information content based model for the topological properties of the gene regulatory network of Escherichia coli

    Full text link
    Gene regulatory networks (GRN) are being studied with increasingly precise quantitative tools and can provide a testing ground for ideas regarding the emergence and evolution of complex biological networks. We analyze the global statistical properties of the transcriptional regulatory network of the prokaryote Escherichia coli, identifying each operon with a node of the network. We propose a null model for this network using the content-based approach applied earlier to the eukaryote Saccharomyces cerevisiae. (Balcan et al., 2007) Random sequences that represent promoter regions and binding sequences are associated with the nodes. The length distributions of these sequences are extracted from the relevant databases. The network is constructed by testing for the occurrence of binding sequences within the promoter regions. The ensemble of emergent networks yields an exponentially decaying in-degree distribution and a putative power law dependence for the out-degree distribution with a flat tail, in agreement with the data. The clustering coefficient, degree-degree correlation, rich club coefficient and k-core visualization all agree qualitatively with the empirical network to an extent not yet achieved by any other computational model, to our knowledge. The significant statistical differences can point the way to further research into non-adaptive and adaptive processes in the evolution of the E. coli GRN.Comment: 58 pages, 3 tables, 22 figures. In press, Journal of Theoretical Biology (2009)

    Minimal Absent Words in Rooted and Unrooted Trees

    Get PDF
    We extend the theory of minimal absent words to (rooted and unrooted) trees, having edges labeled by letters from an alphabet of cardinality. We show that the set of minimal absent words of a rooted (resp. unrooted) tree T with n nodes has cardinality (resp.), and we show that these bounds are realized. Then, we exhibit algorithms to compute all minimal absent words in a rooted (resp. unrooted) tree in output-sensitive time (resp. assuming an integer alphabet of size polynomial in n

    The Information Coded in the Yeast Response Elements Accounts for Most of the Topological Properties of Its Transcriptional Regulation Network

    Get PDF
    The regulation of gene expression in a cell relies to a major extent on transcription factors, proteins which recognize and bind the DNA at specific binding sites (response elements) within promoter regions associated with each gene. We present an information theoretic approach to modeling transcriptional regulatory networks, in terms of a simple “sequence-matching” rule and the statistics of the occurrence of binding sequences of given specificity in random promoter regions. The crucial biological input is the distribution of the amount of information coded in these cognate response elements and the length distribution of the promoter regions. We provide an analysis of the transcriptional regulatory network of yeast Saccharomyces cerevisiae, which we extract from the available databases, with respect to the degree distributions, clustering coefficient, degree correlations, rich-club coefficient and the k-core structure. We find that these topological features are in remarkable agreement with those predicted by our model, on the basis of the amount of information coded in the interaction between the transcription factors and response elements
    • 

    corecore