825 research outputs found

    A transcription frame-based analysis of the genomic DNA sequence of a hyper-thermophilic archaeon for the identification of genes, pseudo-genes and operon structures

    Get PDF
    AbstractAn algorithm for identifying transcription units, independently regulated genes and operons, and pseudo-genes that are not expected to be expressed, has been developed by combining a system for predicting transcription and translation signals, and a system for scoring the triplet periodicity in ORF candidates. By using the algorithm, the 1.09 Mb sequence that covers approximately 60% of the genome of Pyrococcus sp. OT3 has been analyzed. The identified ORFs show the expected biological and physical characteristics, while the rejected ORF candidates do not. Frequent use of operon structures for transcription, and gene duplication followed by mutation or termination of the duplicated genes, are discussed

    Visualization of the protein-coding regions with a self adaptive spectral rotation approach

    Get PDF
    Identifying protein-coding regions in DNA sequences is an active issue in computational biology. In this study, we present a self adaptive spectral rotation (SASR) approach, which visualizes coding regions in DNA sequences, based on investigation of the Triplet Periodicity property, without any preceding training process. It is proposed to help with the rough coding regions prediction when there is no extra information for the training required by other outstanding methods. In this approach, at each position in the DNA sequence, a Fourier spectrum is calculated from the posterior subsequence. Following the spectrums, a random walk in complex plane is generated as the SASR's graphic output. Applications of the SASR on real DNA data show that patterns in the graphic output reveal locations of the coding regions and the frame shifts between them: arcs indicate coding regions, stable points indicate non-coding regions and corners’ shapes reveal frame shifts. Tests on genomic data set from Saccharomyces Cerevisiae reveal that the graphic patterns for coding and non-coding regions differ to a great extent, so that the coding regions can be visually distinguished. Meanwhile, a time cost test shows that the SASR can be easily implemented with the computational complexity of O(N)

    Visualization of the protein-coding regions with a self adaptive spectral rotation approach

    Get PDF
    Identifying protein-coding regions in DNA sequences is an active issue in computational biology. In this study, we present a self adaptive spectral rotation (SASR) approach, which visualizes coding regions in DNA sequences, based on investigation of the Triplet Periodicity property, without any preceding training process. It is proposed to help with the rough coding regions prediction when there is no extra information for the training required by other outstanding methods. In this approach, at each position in the DNA sequence, a Fourier spectrum is calculated from the posterior subsequence. Following the spectrums, a random walk in complex plane is generated as the SASR's graphic output. Applications of the SASR on real DNA data show that patterns in the graphic output reveal locations of the coding regions and the frame shifts between them: arcs indicate coding regions, stable points indicate non-coding regions and corners’ shapes reveal frame shifts. Tests on genomic data set from Saccharomyces Cerevisiae reveal that the graphic patterns for coding and non-coding regions differ to a great extent, so that the coding regions can be visually distinguished. Meanwhile, a time cost test shows that the SASR can be easily implemented with the computational complexity of O(N)

    Periodicity of DNA in exons

    Get PDF
    BACKGROUND: The periodic pattern of DNA in exons is a known phenomenon. It was suggested that one of the initial causes of periodicity could be the universal (RNY)(n)pattern (R = A or G, Y = C or U, N = any base) of ancient RNA. Two major questions were addressed in this paper. Firstly, the cause of DNA periodicity, which was investigated by comparisons between real and simulated coding sequences. Secondly, quantification of DNA periodicity was made using an evolutionary algorithm, which was not previously used for such purposes. RESULTS: We have shown that simulated coding sequences, which were composed using codon usage frequencies only, demonstrate DNA periodicity very similar to the observed in real exons. It was also found that DNA periodicity disappears in the simulated sequences, when the frequencies of codons become equal. Frequencies of the nucleotides (and the dinucleotide AG) at each location along phase 0 exons were calculated for C. elegans, D. melanogaster and H. sapiens. Two models were used to fit these data, with the key objective of describing periodicity. Both of the models showed that the best-fit curves closely matched the actual data points. The first dynamic period determination model consistently generated a value, which was very close to the period equal to 3 nucleotides. The second fixed period model, as expected, kept the period exactly equal to 3 and did not detract from its goodness of fit. CONCLUSIONS: Conclusion can be drawn that DNA periodicity in exons is determined by codon usage frequencies. It is essential to differentiate between DNA periodicity itself, and the length of the period equal to 3. Periodicity itself is a result of certain combinations of codons with different frequencies typical for a species. The length of period equal to 3, instead, is caused by the triplet nature of genetic code. The models and evolutionary algorithm used for characterising DNA periodicity are proven to be an effective tool for describing the periodicity pattern in a species, when a number of exons in the same phase are analysed

    Localizing triplet periodicity in DNA and cDNA sequences

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The protein-coding regions (coding exons) of a DNA sequence exhibit a triplet periodicity (TP) due to fact that coding exons contain a series of three nucleotide codons that encode specific amino acid residues. Such periodicity is usually not observed in introns and intergenic regions. If a DNA sequence is divided into small segments and a Fourier Transform is applied on each segment, a strong peak at frequency 1/3 is typically observed in the Fourier spectrum of coding segments, but not in non-coding regions. This property has been used in identifying the locations of protein-coding genes in unannotated sequence. The method is fast and requires no training. However, the need to compute the Fourier Transform across a segment (window) of arbitrary size affects the accuracy with which one can localize TP boundaries. Here, we report a technique that provides higher-resolution identification of these boundaries, and use the technique to explore the biological correlates of TP regions in the genome of the model organism <it>C. elegans</it>.</p> <p>Results</p> <p>Using both simulated TP signals and the real <it>C. elegans </it>sequence F56F11 as an example, we demonstrate that, (1) Modified Wavelet Transform (MWT) can better define the boundary of TP region than the conventional Short Time Fourier Transform (STFT); (2) The scale parameter (a) of MWT determines the precision of TP boundary localization: bigger values of a give sharper TP boundaries but result in a lower signal to noise ratio; (3) RNA splicing sites have weaker TP signals than coding region; (4) TP signals in coding region can be destroyed or recovered by frame-shift mutations; (5) 6 bp periodicities in introns and intergenic region can generate false positive signals and it can be removed with 6 bp MWT.</p> <p>Conclusions</p> <p>MWT can provide more precise TP boundaries than STFT and the boundaries can be further refined by bigger scale MWT. Subtraction of 6 bp periodicity signals reduces the number of false positives. Experimentally-introduced frame-shift mutations help recover TP signal that have been lost by possible ancient frame-shifts. More importantly, TP signal has the potential to be used to detect the splice junctions in fully spliced mRNA sequence.</p

    Mechanisms of Geomagnetic Field Influence on Gene Expression Using Influenza as a Model System: Basics of Physical Epidemiology

    Get PDF
    Recent studies demonstrate distinct changes in gene expression in cells exposed to a weak magnetic field (MF). Mechanisms of this phenomenon are not understood yet. We propose that proteins of the Cryptochrome family (CRY) are “epigenetic sensors” of the MF fluctuations, i.e., magnetic field-sensitive part of the epigenetic controlling mechanism. It was shown that CRY represses activity of the major circadian transcriptional complex CLOCK/BMAL1. At the same time, function of CRY, is apparently highly responsive to weak MF because of radical pairs that periodically arise in the functionally active site of CRY and mediate the radical pair mechanism of magnetoreception. It is known that the circadian complex influences function of every organ and tissue, including modulation of both NF-κB- and glucocorticoids- dependent signaling pathways. Thus, MFs and solar cycles-dependent geomagnetic field fluctuations are capable of altering expression of genes related to function of NF-κB, hormones and other biological regulators. Notably, NF-κB, along with its significant role in immune response, also participates in differential regulation of influenza virus RNA synthesis. Presented data suggests that in the case of global application (example—geomagnetic field), MF-mediated regulation may have epidemiological and other consequences

    Patterns of nucleotides that flank substitutions in human orthologous genes

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Sequence context is an important aspect of base mutagenesis, and three-base periodicity is an intrinsic property of coding sequences. However, how three-base periodicity is influenced in the vicinity of substitutions is still unclear. The effect of context on mutagenesis should be revealed in the usage of nucleotides that flank substitutions. Relative entropy (also known as Kullback-Leibler divergence) is useful for finding unusual patterns in biological sequences.</p> <p>Results</p> <p>Using relative entropy, we visualized the periodic patterns in the context of substitutions in human orthologous genes. Neighbouring patterns differed both among substitution categories and within a category that occurred at three codon positions. Transition tended to occur in periodic sequences relative to transversion. Periodic signals were stronger in a set of flanking sequences of substitutions that occurred at the third-codon positions than in those that occurred at the first- or second-codon positions. To determine how the three-base periodicity was affected near the substitution sites, we fitted a sine model to the values of the relative entropy. A sine of period equal to 3 is a good approximation for the three-base periodicity at sites not in close vicinity to some substitutions. These periods were interrupted near the substitution site and then reappeared away from substitutions. A comparative analysis between the native and codon-shuffled datasets suggested that the codon usage frequency was not the sole origin of the three-base periodicity, implying that the native order of codons also played an important role in this periodicity. Synonymous codon shuffling revealed that synonymous codon usage bias was one of the factors responsible for the observed three-base periodicity.</p> <p>Conclusions</p> <p>Our results offer an efficient way to illustrate unusual periodic patterns in the context of substitutions and provide further insight into the origin of three-base periodicity. This periodicity is a result of the native codon order in the reading frame. The length of the period equal to 3 is caused by the usage bias of nucleotides in synonymous codons. The periodic features in nucleotides surrounding substitutions aid in further understanding genetic variation and nucleotide mutagenesis.</p

    On the Evolution of the Standard Genetic Code: Vestiges of Critical Scale Invariance from the RNA World in Current Prokaryote Genomes

    Get PDF
    Herein two genetic codes from which the primeval RNA code could have originated the standard genetic code (SGC) are derived. One of them, called extended RNA code type I, consists of all codons of the type RNY (purine-any base-pyrimidine) plus codons obtained by considering the RNA code but in the second (NYR type) and third (YRN type) reading frames. The extended RNA code type II, comprises all codons of the type RNY plus codons that arise from transversions of the RNA code in the first (YNY type) and third (RNR) nucleotide bases. In order to test if putative nucleotide sequences in the RNA World and in both extended RNA codes, share the same scaling and statistical properties to those encountered in current prokaryotes, we used the genomes of four Eubacteria and three Archaeas. For each prokaryote, we obtained their respective genomes obeying the RNA code or the extended RNA codes types I and II. In each case, we estimated the scaling properties of triplet sequences via a renormalization group approach, and we calculated the frequency distributions of distances for each codon. Remarkably, the scaling properties of the distance series of some codons from the RNA code and most codons from both extended RNA codes turned out to be identical or very close to the scaling properties of codons of the SGC. To test for the robustness of these results, we show, via computer simulation experiments, that random mutations of current genomes, at the rates of 10−10 per site per year during three billions of years, were not enough for destroying the observed patterns. Therefore, we conclude that most current prokaryotes may still contain relics of the primeval RNA World and that both extended RNA codes may well represent two plausible evolutionary paths between the RNA code and the current SGC

    Detection of frameshifts and improving genome annotation

    Get PDF
    We developed a new program called GeneTack for ab initio frameshift detection in intronless protein-coding nucleotide sequences. The GeneTack program uses a hidden Markov model (HMM) of a genomic sequence with possibly frameshifted protein-coding regions. The Viterbi algorithm nds the maximum likelihood path that discriminates between true adjacent genes and a single gene with a frameshift. We tested GeneTack as well as two other earlier developed programs FrameD and FSFind on 17 prokaryotic genomes with frameshifts introduced randomly into known genes. We observed that the average frameshift prediction accuracy of GeneTack, in terms of (Sn+Sp)/2 values, was higher by a signicant margin than the accuracy of the other two programs. GeneTack was used to screen 1,106 complete prokaryotic genomes and 206,991 genes with frameshifts (fs-genes) were identifed. Our goal was to determine if a frameshift transition was due to (i) a sequencing error, (ii) an indel mutation or (iii) a recoding event. We grouped 102,731 genes with frameshifts (fs-genes) into 19,430 clusters based on sequence similarity between their protein products (fs-proteins), conservation of predicted frameshift position, and its direction. While fs-genes in 2,810 clusters were classied as conserved pseudogenes and fs-genes in 1,200 clusters were classied as hypothetical pseudogenes, 5,632 fs-genes from 239 clusters pos- sessing conserved motifs near frameshifts were predicted to be recoding candidates. Experiments were performed for sequences derived from 20 out of the 239 clusters; programmed ribosomal frameshifting with eciency higher than 10% was observed for four clusters. GeneTack was also applied to 1,165,799 mRNAs from 100 eukaryotic species and 45,295 frameshifts were identied. A clustering approach similar to the one used for prokaryotic fs-genes allowed us to group 12,103 fs-genes into 4,087 clusters. Known programmed frameshift genes were among the obtained clusters. Several clusters may correspond to new examples of dual coding genes. We developed a web interface to browse a database containing all the fs-genes predicted by GeneTack in prokaryotic genomes and eukaryotic mRNA sequences. The fs-genes can be retrieved by similarity search to a given query sequence, by fs- gene cluster browsing, etc. Clusters of fs-genes are characterized with respect to their likely origin, such as pseudogenization, phase variation, programmed frameshifts etc. All the tools and the database of fs-genes are available at the GeneTack web site http://topaz.gatech.edu/GeneTack/PhDCommittee Chair: Borodovsky, Mark; Committee Member: Baranov, Pavel; Committee Member: Hammer, Brian; Committee Member: Jordan, King; Committee Member: Konstantinidis, Kostas; Committee Member: Song, L
    corecore