160 research outputs found

    An Unusual 500,000 Bases Long Oscillation of Guanine and Cytosine Content in Human Chromosome 21

    Full text link
    An oscillation with a period of around 500 kb in guanine and cytosine content (GC%) is observed in the DNA sequence of human chromosome 21. This oscillation is localized in the rightmost one-eighth region of the chromosome, from 43.5 Mb to 46.5 Mb. Five cycles of oscillation are observed in this region with six GC-rich peaks and five GC-poor valleys. The GC-poor valleys comprise regions with low density of CpG islands and, alternating between the two DNA strands, low gene density regions. Consequently, the long-range oscillation of GC% result in spacing patterns of both CpG island density, and to a lesser extent, gene densities.Comment: 15 pages (figures included), 5 figure

    Hierarchical structure of cascade of primary and secondary periodicities in Fourier power spectrum of alphoid higher order repeats

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Identification of approximate tandem repeats is an important task of broad significance and still remains a challenging problem of computational genomics. Often there is no single best approach to periodicity detection and a combination of different methods may improve the prediction accuracy. Discrete Fourier transform (DFT) has been extensively used to study primary periodicities in DNA sequences. Here we investigate the application of DFT method to identify and study alphoid higher order repeats.</p> <p>Results</p> <p>We used method based on DFT with mapping of symbolic into numerical sequence to identify and study alphoid higher order repeats (HOR). For HORs the power spectrum shows equidistant frequency pattern, with characteristic two-level hierarchical organization as signature of HOR. Our case study was the 16 mer HOR tandem in AC017075.8 from human chromosome 7. Very long array of equidistant peaks at multiple frequencies (more than a thousand higher harmonics) is based on fundamental frequency of 16 mer HOR. Pronounced subset of equidistant peaks is based on multiples of the fundamental HOR frequency (multiplication factor <it>n </it>for <it>n</it>mer) and higher harmonics. In general, <it>n</it>mer HOR-pattern contains equidistant secondary periodicity peaks, having a pronounced subset of equidistant primary periodicity peaks. This hierarchical pattern as signature for HOR detection is robust with respect to monomer insertions and deletions, random sequence insertions etc. For a monomeric alphoid sequence only primary periodicity peaks are present. The 1/<it>f</it><sup><it>β </it></sup>– noise and periodicity three pattern are missing from power spectra in alphoid regions, in accordance with expectations.</p> <p>Conclusion</p> <p>DFT provides a robust detection method for higher order periodicity. Easily recognizable HOR power spectrum is characterized by hierarchical two-level equidistant pattern: higher harmonics of the fundamental HOR-frequency (secondary periodicity) and a subset of pronounced peaks corresponding to constituent monomers (primary periodicity). The number of lower frequency peaks (secondary periodicity) below the frequency of the first primary periodicity peak reveals the size of <it>n</it>mer HOR, i.e., the number <it>n </it>of monomers contained in consensus HOR.</p

    Quantumlike Chaos in the Frequency Distributions of the Bases A, C, G, T in Drosophila DNA

    Get PDF
    Continuous periodogram power spectral analyses of fractal fluctuations of frequency distributions of bases A, C, G, T in Drosophila DNA show that the power spectra follow the universal inverse power-law form of the statistical normal distribution. Inverse power-law form for power spectra of space-time fluctuations is generic to dynamical systems in nature and is identified as self-organized criticality. The author has developed a general systems theory, which provides universal quantification for observed self-organized criticality in terms of the statistical normal distribution. The long-range correlations intrinsic to self-organized criticality in macro-scale dynamical systems are a signature of quantumlike chaos. The fractal fluctuations self-organize to form an overall logarithmic spiral trajectory with the quasiperiodic Penrose tiling pattern for the internal structure. Power spectral analysis resolves such a spiral trajectory as an eddy continuum with embedded dominant wavebands. The dominant peak periodicities are functions of the golden mean. The observed fractal frequency distributions of the Drosophila DNA base sequences exhibit quasicrystalline structure with long-range spatial correlations or self-organized criticality. Modification of the DNA base sequence structure at any location may have significant noticeable effects on the function of the DNA molecule as a whole. The presence of non-coding introns may not be redundant, but serve to organize the effective functioning of the coding exons in the DNA molecule as a complete unit.Comment: 46 pages, 9 figure

    Statistical methods for detecting periodic fragments in DNA sequence data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Period 10 dinucleotides are structurally and functionally validated factors that influence the ability of DNA to form nucleosomes, histone core octamers. Robust identification of periodic signals in DNA sequences is therefore required to understand nucleosome organisation in genomes. While various techniques for identifying periodic components in genomic sequences have been proposed or adopted, the requirements for such techniques have not been considered in detail and confirmatory testing for a priori specified periods has not been developed.</p> <p>Results</p> <p>We compared the estimation accuracy and suitability for confirmatory testing of autocorrelation, discrete Fourier transform (DFT), integer period discrete Fourier transform (IPDFT) and a previously proposed Hybrid measure. A number of different statistical significance procedures were evaluated but a blockwise bootstrap proved superior. When applied to synthetic data whose period-10 signal had been eroded, or for which the signal was approximately period-10, the Hybrid technique exhibited superior properties during exploratory period estimation. In contrast, confirmatory testing using the blockwise bootstrap procedure identified IPDFT as having the greatest statistical power. These properties were validated on yeast sequences defined from a ChIP-chip study where the Hybrid metric confirmed the expected dominance of period-10 in nucleosome associated DNA but IPDFT identified more significant occurrences of period-10. Application to the whole genomes of yeast and mouse identified ~ 21% and ~ 19% respectively of these genomes as spanned by period-10 nucleosome positioning sequences (NPS).</p> <p>Conclusions</p> <p>For estimating the dominant period, we find the Hybrid period estimation method empirically to be the most effective for both eroded and approximate periodicity. The blockwise bootstrap was found to be effective as a significance measure, performing particularly well in the problem of period detection in the presence of eroded periodicity. The autocorrelation method was identified as poorly suited for use with the blockwise bootstrap. Application of our methods to the genomes of two model organisms revealed a striking proportion of the yeast and mouse genomes are spanned by NPS. Despite their markedly different sizes, roughly equivalent proportions (19-21%) of the genomes lie within period-10 spans of the NPS dinucleotides {<it>AA, TT, TA</it>}. The biological significance of these regions remains to be demonstrated. To facilitate this, the genomic coordinates are available as Additional files 1, 2, and 3 in a format suitable for visualisation as tracks on popular genome browsers.</p> <p>Reviewers</p> <p>This article was reviewed by Prof Tomas Radivoyevitch, Dr Vsevolod Makeev (nominated by Dr Mikhail Gelfand), and Dr Rob D Knight.</p

    Human Promoter Prediction Using DNA Numerical Representation

    Get PDF
    With the emergence of genomic signal processing, numerical representation techniques for DNA alphabet set {A, G, C, T} play a key role in applying digital signal processing and machine learning techniques for processing and analysis of DNA sequences. The choice of the numerical representation of a DNA sequence affects how well the biological properties can be reflected in the numerical domain for the detection and identification of the characteristics of special regions of interest within the DNA sequence. This dissertation presents a comprehensive study of various DNA numerical and graphical representation methods and their applications in processing and analyzing long DNA sequences. Discussions on the relative merits and demerits of the various methods, experimental results and possible future developments have also been included. Another area of the research focus is on promoter prediction in human (Homo Sapiens) DNA sequences with neural network based multi classifier system using DNA numerical representation methods. In spite of the recent development of several computational methods for human promoter prediction, there is a need for performance improvement. In particular, the high false positive rate of the feature-based approaches decreases the prediction reliability and leads to erroneous results in gene annotation.To improve the prediction accuracy and reliability, DigiPromPred a numerical representation based promoter prediction system is proposed to characterize DNA alphabets in different regions of a DNA sequence.The DigiPromPred system is found to be able to predict promoters with a sensitivity of 90.8% while reducing false prediction rate for non-promoter sequences with a specificity of 90.4%. The comparative study with state-of-the-art promoter prediction systems for human chromosome 22 shows that our proposed system maintains a good balance between prediction accuracy and reliability. To reduce the system architecture and computational complexity compared to the existing system, a simple feed forward neural network classifier known as SDigiPromPred is proposed. The SDigiPromPred system is found to be able to predict promoters with a sensitivity of 87%, 87%, 99% while reducing false prediction rate for non-promoter sequences with a specificity of 92%, 94%, 99% for Human, Drosophila, and Arabidopsis sequences respectively with reconfigurable capability compared to existing system

    Periodicity detection and its application in lifelog data

    Get PDF
    Wearable sensors are catching our attention not only in industry but also in the market. We can now acquire sensor data from different types of health tracking devices like smart watches, smart bands, lifelog cameras and most smart phones are capable of tracking and logging information using built-in sensors. As data is generated and collected from various sources constantly, researchers have focused on interpreting and understanding the semantics of this longitudinal multi-modal data. One challenge is the fusion of multi-modal data and achieving good performance on tasks such activity recognition, event detection and event segmentation. The classical approach to process the data generated by wearable sensors has three main parts: 1) Event segmentation 2) Event recognition 3) Event retrieval. Many papers have been published in each of the three fields. This thesis has focused on the longitudinal aspect of the data from wearable sensors, instead of concentrating on the data over a short period of time. The following aspects are several key research questions in the thesis. Does longitudinal sensor data have unique features than can distinguish the subject generating the data from other subjects ? In other words, from the longitudinal perspective, does the data from different subjects share more common structure/similarity/identical patterns so that it is difficult to identify a subject using the data. If this is the case, what are those common patterns ? If we are able to eliminate those similarities among all the data, does the data show more specific features that we can use to model the data series and predict the future values ? If there are repeating patterns in longitudinal data, we can use different methods to compute the periodicity of the recurring patterns and furthermore to identify and extract those patterns. Following that we could be able to compare local data over a short time period with more global patterns in order to show the regularity of the local data. Some case studies are included in the thesis to show the value of longitudinal lifelog data related to a correlation of health conditions and training performance
    corecore