91,528 research outputs found

    Structural fingerprints of transcription factor binding site regions

    Get PDF
    Fourier transforms are a powerful tool in the prediction of DNA sequence properties, such as the presence/absence of codons. We have previously compiled a database of the structural properties of all 32,896 unique DNA octamers. In this work we apply Fourier techniques to the analysis of the structural properties of human chromosomes 21 and 22 and also to three sets of transcription factor binding sites within these chromosomes. We find that, for a given structural property, the structural property power spectra of chromosomes 21 and 22 are strikingly similar. We find common peaks in their power spectra for both Sp1 and p53 transcription factor binding sites. We use the power spectra as a structural fingerprint and perform similarity searching in order to find transcription factor binding site regions. This approach provides a new strategy for searching the genome data for information. Although it is difficult to understand the relationship between specific functional properties and the set of structural parameters in our database, our structural fingerprints nevertheless provide a useful tool for searching for function information in sequence data. The power spectrum fingerprints provide a simple, fast method for comparing a set of functional sequences, in this case transcription factor binding site regions, with the sequences of whole chromosomes. On its own, the power spectrum fingerprint does not find all transcription factor binding sites in a chromosome, but the results presented here show that in combination with other approaches, this technique will improve the chances of identifying functional sequences hidden in genomic data

    Time-dependent ARMA modeling of genomic sequences

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Over the past decade, many investigators have used sophisticated time series tools for the analysis of genomic sequences. Specifically, the correlation of the nucleotide chain has been studied by examining the properties of the power spectrum. The main limitation of the power spectrum is that it is restricted to stationary time series. However, it has been observed over the past decade that genomic sequences exhibit non-stationary statistical behavior. Standard statistical tests have been used to verify that the genomic sequences are indeed not stationary. More recent analysis of genomic data has relied on time-varying power spectral methods to capture the statistical characteristics of genomic sequences. Techniques such as the evolutionary spectrum and evolutionary periodogram have been successful in extracting the time-varying correlation structure. The main difficulty in using time-varying spectral methods is that they are extremely unstable. Large deviations in the correlation structure results from very minor perturbations in the genomic data and experimental procedure. A fundamental new approach is needed in order to provide a stable platform for the non-stationary statistical analysis of genomic sequences.</p> <p>Results</p> <p>In this paper, we propose to model non-stationary genomic sequences by a time-dependent autoregressive moving average (TD-ARMA) process. The model is based on a classical ARMA process whose coefficients are allowed to vary with time. A series expansion of the time-varying coefficients is used to form a generalized Yule-Walker-type system of equations. A recursive least-squares algorithm is subsequently used to estimate the time-dependent coefficients of the model. The non-stationary parameters estimated are used as a basis for statistical inference and biophysical interpretation of genomic data. In particular, we rely on the TD-ARMA model of genomic sequences to investigate the statistical properties and differentiate between coding and non-coding regions in the nucleotide chain. Specifically, we define a quantitative measure of randomness to assess how far a process deviates from white noise. Our simulation results on various gene sequences show that both the coding and non-coding regions are non-random. However, coding sequences are "whiter" than non-coding sequences as attested by a higher index of randomness.</p> <p>Conclusion</p> <p>We demonstrate that the proposed TD-ARMA model can be used to provide a stable time series tool for the analysis of non-stationary genomic sequences. The estimated time-varying coefficients are used to define an index of randomness, in order to assess the statistical correlations in coding and non-coding DNA sequences. It turns out that the statistical differences between coding and non-coding sequences are more subtle than previously thought using stationary analysis tools: Both coding and non-coding sequences exhibit statistical correlations, with the coding regions being "whiter" than the non-coding regions. These results corroborate the evolutionary periodogram analysis of genomic sequences and revoke the stationary analysis' conclusion that coding DNA behaves like random sequences.</p

    Spectral Analysis of Guanine and Cytosine Fluctuations of Mouse Genomic DNA

    Full text link
    We study global fluctuations of the guanine and cytosine base content (GC%) in mouse genomic DNA using spectral analyses. Power spectra S(f) of GC% fluctuations in all nineteen autosomal and two sex chromosomes are observed to have the universal functional form S(f) \sim 1/f^alpha (alpha \approx 1) over several orders of magnitude in the frequency range 10^-7< f < 10^-5 cycle/base, corresponding to long-ranging GC% correlations at distances between 100 kb and 10 Mb. S(f) for higher frequencies (f > 10^-5 cycle/base) shows a flattened power-law function with alpha < 1 across all twenty-one chromosomes. The substitution of about 38% interspersed repeats does not affect the functional form of S(f), indicating that these are not predominantly responsible for the long-ranged multi-scale GC% fluctuations in mammalian genomes. Several biological implications of the large-scale GC% fluctuation are discussed, including neutral evolutionary history by DNA duplication, chromosomal bands, spatial distribution of transcription units (genes), replication timing, and recombination hot spots.Comment: 15 pages (figures included), 2 figure

    Quantumlike Chaos in the Frequency Distributions of the Bases A, C, G, T in Drosophila DNA

    Get PDF
    Continuous periodogram power spectral analyses of fractal fluctuations of frequency distributions of bases A, C, G, T in Drosophila DNA show that the power spectra follow the universal inverse power-law form of the statistical normal distribution. Inverse power-law form for power spectra of space-time fluctuations is generic to dynamical systems in nature and is identified as self-organized criticality. The author has developed a general systems theory, which provides universal quantification for observed self-organized criticality in terms of the statistical normal distribution. The long-range correlations intrinsic to self-organized criticality in macro-scale dynamical systems are a signature of quantumlike chaos. The fractal fluctuations self-organize to form an overall logarithmic spiral trajectory with the quasiperiodic Penrose tiling pattern for the internal structure. Power spectral analysis resolves such a spiral trajectory as an eddy continuum with embedded dominant wavebands. The dominant peak periodicities are functions of the golden mean. The observed fractal frequency distributions of the Drosophila DNA base sequences exhibit quasicrystalline structure with long-range spatial correlations or self-organized criticality. Modification of the DNA base sequence structure at any location may have significant noticeable effects on the function of the DNA molecule as a whole. The presence of non-coding introns may not be redundant, but serve to organize the effective functioning of the coding exons in the DNA molecule as a complete unit.Comment: 46 pages, 9 figure

    Google matrix analysis of DNA sequences

    Get PDF
    For DNA sequences of various species we construct the Google matrix G of Markov transitions between nearby words composed of several letters. The statistical distribution of matrix elements of this matrix is shown to be described by a power law with the exponent being close to those of outgoing links in such scale-free networks as the World Wide Web (WWW). At the same time the sum of ingoing matrix elements is characterized by the exponent being significantly larger than those typical for WWW networks. This results in a slow algebraic decay of the PageRank probability determined by the distribution of ingoing elements. The spectrum of G is characterized by a large gap leading to a rapid relaxation process on the DNA sequence networks. We introduce the PageRank proximity correlator between different species which determines their statistical similarity from the view point of Markov chains. The properties of other eigenstates of the Google matrix are also discussed. Our results establish scale-free features of DNA sequence networks showing their similarities and distinctions with the WWW and linguistic networks.Comment: latex, 11 fig

    Cluster-scaling, chaotic order and coherence in DNA

    Full text link
    Different numerical mappings of the DNA sequences have been studied using a new cluster-scaling method and the well known spectral methods. It is shown, in particular, that the nucleotide sequences in DNA molecules have robust cluster-scaling properties. These properties are relevant to both types of nucleotide pair-bases interactions: hydrogen bonds and stacking interactions. It is shown that taking into account the cluster-scaling properties can help to improve heterogeneous models of the DNA dynamics. It is also shown that a chaotic (deterministic) order, rather than a stochastic randomness, controls the energy minima positions of the stacking interactions in the DNA sequences on large scales. The chaotic order results in a large-scale chaotic coherence between the two complimentary DNA-duplex's sequences. A competition between this broad-band chaotic coherence and the resonance coherence produced by genetic code has been briefly discussed. The Arabidopsis plant genome (which is a model plant for genome analysis) and two human genes: BRCA2 and NRXN1, have been considered as examples.Comment: extended. arXiv admin note: substantial text overlap with arXiv:1008.135

    Long range correlations in DNA : scaling properties and charge transfer efficiency

    Get PDF
    We address the relation between long range correlations and charge transfer efficiency in aperiodic artificial or genomic DNA sequences. Coherent charge transfer through the HOMO states of the guanine nucleotide is studied using the transmission approach, and focus is made on how the sequence-dependent backscattering profile can be inferred from correlations between base pairs.Comment: Submitted to Phys. Rev. Let
    • …