589 research outputs found

    Distinguish Coding And Noncoding Sequences In A Complete Genome Using Fourier Transform

    Get PDF
    A Fourier transform method is proposed to distinguish coding and non-coding sequences in a complete genome based on a number sequence representation of the DNA sequence proposed in our previous paper (Zhou et al., J. Theor. Biol. 2005) and the imperfect periodicity of 3 in protein coding sequences. The three parameters P_x(S) (1), P_x(S) (1/3) and P_x(S) (1/36) in the Fourier transform of the number sequence representation of DNA sequences are selected to form a three-dimensional parameter space. Each DNA sequence is then represented by a point in this space. The points corresponding to coding and non-coding sequences in the complete genome of prokaryotes are seen to be divided into different regions. If the point (P_x(�ar S) (1), Px(�ar S) (1/3), P_x(�ar S) (1/36)) for a DNA sequence is situated in the region corresponding to coding sequences, the sequence is distinguished as a coding sequence; otherwise, the sequence is classified as a noncoding one. Fisher's discriminant algorithm is used to study the discriminant accuracy. The average discriminant accuracies pc, pnc, qc and qnc of all 51 prokaryotes obtained by the present method reach 81.02%, 92.27%, 80.77% and 92.24% respectively

    Genomics and proteomics: a signal processor's tour

    Get PDF
    The theory and methods of signal processing are becoming increasingly important in molecular biology. Digital filtering techniques, transform domain methods, and Markov models have played important roles in gene identification, biological sequence analysis, and alignment. This paper contains a brief review of molecular biology, followed by a review of the applications of signal processing theory. This includes the problem of gene finding using digital filtering, and the use of transform domain methods in the study of protein binding spots. The relatively new topic of noncoding genes, and the associated problem of identifying ncRNA buried in DNA sequences are also described. This includes a discussion of hidden Markov models and context free grammars. Several new directions in genomic signal processing are briefly outlined in the end

    Correlation property of length sequences based on global structure of complete genome

    Get PDF
    This paper considers three kinds of length sequences of the complete genome. Detrended fluctuation analysis, spectral analysis, and the mean distance spanned within time LL are used to discuss the correlation property of these sequences. The values of the exponents from these methods of these three kinds of length sequences of bacteria indicate that the long-range correlations exist in most of these sequences. The correlation have a rich variety of behaviours including the presence of anti-correlations. Further more, using the exponent γ\gamma, it is found that these correlations are all linear (γ=1.0±0.03\gamma=1.0\pm 0.03). It is also found that these sequences exhibit 1/f1/f noise in some interval of frequency (f>1f>1). The length of this interval of frequency depends on the length of the sequence. The shape of the periodogram in f>1f>1 exhibits some periodicity. The period seems to depend on the length and the complexity of the length sequence.Comment: RevTex, 9 pages with 5 figures and 3 tables. Phys. Rev. E Jan. 1,2001 (to appear

    Measure representation and multifractal analysis of complete genomes

    Get PDF
    This paper introduces the notion of measure representation of DNA sequences. Spectral analysis and multifractal analysis are then performed on the measure representations of a large number of complete genomes. The main aim of this paper is to discuss the multifractal property of the measure representation and the classification of bacteria. From the measure representations and the values of the DqD_{q} spectra and related CqC_{q} curves, it is concluded that these complete genomes are not random sequences. In fact, spectral analyses performed indicate that these measure representations considered as time series, exhibit strong long-range correlation. For substrings with length K=8, the DqD_{q} spectra of all organisms studied are multifractal-like and sufficiently smooth for the CqC_{q} curves to be meaningful. The CqC_{q} curves of all bacteria resemble a classical phase transition at a critical point. But the 'analogous' phase transitions of chromosomes of non-bacteria organisms are different. Apart from Chromosome 1 of {\it C. elegans}, they exhibit the shape of double-peaked specific heat function.Comment: 12 pages with 9 figures and 1 tabl

    Human Promoter Prediction Using DNA Numerical Representation

    Get PDF
    With the emergence of genomic signal processing, numerical representation techniques for DNA alphabet set {A, G, C, T} play a key role in applying digital signal processing and machine learning techniques for processing and analysis of DNA sequences. The choice of the numerical representation of a DNA sequence affects how well the biological properties can be reflected in the numerical domain for the detection and identification of the characteristics of special regions of interest within the DNA sequence. This dissertation presents a comprehensive study of various DNA numerical and graphical representation methods and their applications in processing and analyzing long DNA sequences. Discussions on the relative merits and demerits of the various methods, experimental results and possible future developments have also been included. Another area of the research focus is on promoter prediction in human (Homo Sapiens) DNA sequences with neural network based multi classifier system using DNA numerical representation methods. In spite of the recent development of several computational methods for human promoter prediction, there is a need for performance improvement. In particular, the high false positive rate of the feature-based approaches decreases the prediction reliability and leads to erroneous results in gene annotation.To improve the prediction accuracy and reliability, DigiPromPred a numerical representation based promoter prediction system is proposed to characterize DNA alphabets in different regions of a DNA sequence.The DigiPromPred system is found to be able to predict promoters with a sensitivity of 90.8% while reducing false prediction rate for non-promoter sequences with a specificity of 90.4%. The comparative study with state-of-the-art promoter prediction systems for human chromosome 22 shows that our proposed system maintains a good balance between prediction accuracy and reliability. To reduce the system architecture and computational complexity compared to the existing system, a simple feed forward neural network classifier known as SDigiPromPred is proposed. The SDigiPromPred system is found to be able to predict promoters with a sensitivity of 87%, 87%, 99% while reducing false prediction rate for non-promoter sequences with a specificity of 92%, 94%, 99% for Human, Drosophila, and Arabidopsis sequences respectively with reconfigurable capability compared to existing system

    Ab initio gene identification: prokaryote genome annotation with GeneScan and GLIMMER

    Get PDF
    We compare the annotation of three complete genomes using theab initio methods of gene identification GeneScan and GLIMMER. The annotation given in GenBank, the standard against which these are compared, has been made using GeneMark. We find a number of novel genes which are predicted by both methods used here, as well as a number of genes that are predicted by GeneMark, but are not identified by either of the nonconsensus methods that we have used. The three organisms studied here are all prokaryotic species with fairly compact genomes. The Fourier measure forms the basis for an efficient non-consensus method for gene prediction, and the algorithm GeneScan exploits this measure. We have bench-marked this program as well as GLIMMER using 3 complete prokaryotic genomes. An effort has also been made to study the limitations of these techniques for complete genome analysis. GeneScan and GLIMMER are of comparable accuracy insofar as gene-identification is concerned, with sensitivities and specificities typically greater than 0.9. The number of false predictions (both positive and negative) is higher for GeneScan as compared to GLIMMER, but in a significant number of cases, similar results are provided by the two techniques. This suggests that there could be some as-yet unidentified additional genes in these three genomes, and also that some of the putative identifications made hitherto might require re-evaluation. All these cases are discussed in detail

    Hierarchical structure of cascade of primary and secondary periodicities in Fourier power spectrum of alphoid higher order repeats

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Identification of approximate tandem repeats is an important task of broad significance and still remains a challenging problem of computational genomics. Often there is no single best approach to periodicity detection and a combination of different methods may improve the prediction accuracy. Discrete Fourier transform (DFT) has been extensively used to study primary periodicities in DNA sequences. Here we investigate the application of DFT method to identify and study alphoid higher order repeats.</p> <p>Results</p> <p>We used method based on DFT with mapping of symbolic into numerical sequence to identify and study alphoid higher order repeats (HOR). For HORs the power spectrum shows equidistant frequency pattern, with characteristic two-level hierarchical organization as signature of HOR. Our case study was the 16 mer HOR tandem in AC017075.8 from human chromosome 7. Very long array of equidistant peaks at multiple frequencies (more than a thousand higher harmonics) is based on fundamental frequency of 16 mer HOR. Pronounced subset of equidistant peaks is based on multiples of the fundamental HOR frequency (multiplication factor <it>n </it>for <it>n</it>mer) and higher harmonics. In general, <it>n</it>mer HOR-pattern contains equidistant secondary periodicity peaks, having a pronounced subset of equidistant primary periodicity peaks. This hierarchical pattern as signature for HOR detection is robust with respect to monomer insertions and deletions, random sequence insertions etc. For a monomeric alphoid sequence only primary periodicity peaks are present. The 1/<it>f</it><sup><it>β </it></sup>– noise and periodicity three pattern are missing from power spectra in alphoid regions, in accordance with expectations.</p> <p>Conclusion</p> <p>DFT provides a robust detection method for higher order periodicity. Easily recognizable HOR power spectrum is characterized by hierarchical two-level equidistant pattern: higher harmonics of the fundamental HOR-frequency (secondary periodicity) and a subset of pronounced peaks corresponding to constituent monomers (primary periodicity). The number of lower frequency peaks (secondary periodicity) below the frequency of the first primary periodicity peak reveals the size of <it>n</it>mer HOR, i.e., the number <it>n </it>of monomers contained in consensus HOR.</p

    Structural fingerprints of transcription factor binding site regions

    Get PDF
    Fourier transforms are a powerful tool in the prediction of DNA sequence properties, such as the presence/absence of codons. We have previously compiled a database of the structural properties of all 32,896 unique DNA octamers. In this work we apply Fourier techniques to the analysis of the structural properties of human chromosomes 21 and 22 and also to three sets of transcription factor binding sites within these chromosomes. We find that, for a given structural property, the structural property power spectra of chromosomes 21 and 22 are strikingly similar. We find common peaks in their power spectra for both Sp1 and p53 transcription factor binding sites. We use the power spectra as a structural fingerprint and perform similarity searching in order to find transcription factor binding site regions. This approach provides a new strategy for searching the genome data for information. Although it is difficult to understand the relationship between specific functional properties and the set of structural parameters in our database, our structural fingerprints nevertheless provide a useful tool for searching for function information in sequence data. The power spectrum fingerprints provide a simple, fast method for comparing a set of functional sequences, in this case transcription factor binding site regions, with the sequences of whole chromosomes. On its own, the power spectrum fingerprint does not find all transcription factor binding sites in a chromosome, but the results presented here show that in combination with other approaches, this technique will improve the chances of identifying functional sequences hidden in genomic data

    Genetic Algorithms for the Imitation of Genomic Styles in Protein Backtranslation

    Get PDF
    Several technological applications require the translation of a protein into a nucleic acid that codes for it (``backtranslation''). The degeneracy of the genetic code makes this translation ambiguous; moreover, not every translation is equally viable. The common answer to this problem is the imitation of the codon usage of the target species. Here we discuss several other features of coding sequences (``coding statistics'') that are relevant for the ``genomic style'' of different species. A genetic algorithm is then used to obtain backtranslations that mimic these styles, by minimizing the difference in the coding statistics. Possible improvements and applications are discussed.Comment: 17 pages, 13 figures. Submitted to Theor. Comp. Scienc
    corecore