27 research outputs found

    Mutual information for examining correlations in DNA

    Full text link
    This paper examines two methods for finding whether long-range correlations exist in DNA: a fractal measure and a mutual information technique. We evaluate the performance and implications of these methods in detail. In particular we explore their use comparing DNA sequences from a variety of sources. Using software for performing in silico mutations, we also consider evolutionary events leading to long range correlations and analyse these correlations using the techniques presented. Comparisons are made between these virtual sequences, randomly generated sequences, and real sequences. We also explore correlations in chromosomes from different species.Comment: 8 pages, 3 figure

    Encoding DNA sequences by integer chaos game representation

    Full text link
    DNA sequences are fundamental for encoding genetic information. The genetic information may not only be understood by symbolic sequences but also from the hidden signals inside the sequences. The symbolic sequences need to be transformed into numerical sequences so the hidden signals can be revealed by signal processing techniques. All current transformation methods encode DNA sequences into numerical values of the same length. These representations have limitations in the applications of genomic signal compression, encryption, and steganography. We propose an integer chaos game representation (iCGR) of DNA sequences and a lossless encoding method DNA sequences by the iCGR. In the iCGR method, a DNA sequence is represented by the iterated function of the nucleotides and their positions in the sequence. Then the DNA sequence can be uniquely encoded and recovered using three integers from iCGR. One integer is the sequence length and the other two integers represent the accumulated distributions of nucleotides in the sequence. The integer encoding scheme can compress a DNA sequence by 2 bits per nucleotide. The integer representation of DNA sequences provides a prospective tool for sequence compression, encryption, and steganography. The Python programs in this study are freely available to the public at https://github.com/cyinbox/iCG

    Fourier and spectral envelope analysis of medically important bacterial and fungal sequences

    Get PDF
    In this paper, we introduce the Fourier and spectral envelope analysis methods to analyze some biomolecular sequences, particularly medically important bacteria and fungi DNA sequences, to get their interesting frequency properties. Fourier analysis includes mapping character strings into numerical sequences, calculating spectra of DNA sequences and setting and solving optimization problem in order to construct a powerful predictor of exons along the long DNA sequences. The spectral envelope analysis makes use of spectral envelope for analyzing periodicities in categorical-valued time series and it is useful for the scaling of non-numeric sequences. The spectral envelope analysis utilizes optimization procedure to improve upon traditional analysis performance in distinguishing coding from non-coding regions in DNA sequences. The two approaches greatly facilitate the understanding of local nature, structure and function of biomolecular sequences. They also provide useful techniques to combine bioinformatics analysis with modern computer power to quickly search for diagnostic patterns within long sequences.published_or_final_versio

    Analyzing DNA Sequences Using Clustering Algorithm

    Get PDF
    Data mining gives a bright prospective in DNA sequences analysis through its concepts and techniques. This study carries out exploratory data analysis method to cluster DNA sequences.Feature vectors have been developed to map the DNA sequences to a twelve-dimensional vector in the space. Lysozyme, Myoglobin and Rhodopsin protein families have been tested in this space. The results of DNA sequences comparison among homologous sequences give close distances between their characterization vectors which are easily distinguishable from non-homologous in experiment it with a fixed DNA sequence size that does not exceed the maximum length of the shortest DNA sequence. Global comparison for multiple DNA sequences simultaneously presented in the genomic space is the main advantage of this work by applying direct comparison of the corresponding characteristic vectors distances. The novelty of this work is that for the new DNA sequence, there is no need to compare the new DNA sequence with the whole DNA sequences length, just the comparison focused on a fixed number of all the sequences in a way that does not exceed the maximum length of the new DNA sequence. In other words, parts of the DNA sequence can identify the functionality of the DNA sequence, and make it clustered with its family members

    Origin of multiple periodicities in the Fourier power spectra of the Plasmodium falciparum genome

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Fourier transforms and their associated power spectra are used for detecting periodicities and protein-coding genes and is generally regarded as a well established technique. Many of the periodicities which have been found with this method are quite well understood such as the periodicity of 3 nt which is associated to codon usage. But what is the origin of the peculiar frequency multiples <it>k</it>/21 which were reported for a tiny section of chromosome 2 in <it>P. falciparum</it>? Are these present in other chromosomes and perhaps in related organisms? And how should we interpret fractional periodicities in genomes?</p> <p>Results</p> <p>We applied the binary indicator power spectrum to all chromosomes of <it>P. falciparum</it>, and found that the frequency overtones <it>k</it>/21 are present only in non-coding sections. We did not find such frequency overtones in any other related genomes. Furthermore, the frequency overtones were identified as artifacts of the way the genome is encoded into a numerical sequence, that is, they are frequency aliases. By choosing a different way to encode the sequence the overtones do not appear. In view of these results, we revisited early applications of this technique to proteins where frequency overtones were reported.</p> <p>Conclusions</p> <p>Some authors hinted recently at the possibility of mapping artifacts and frequency aliases in power spectra. However, in the case of <it>P. falciparum</it> the frequency aliases are particularly strong and can mask the 1/3 frequency which is used for gene detecting. This shows that albeit being a well known technique, with a long history of application in proteins, few researchers seem to be aware of the problems represented by frequency aliases.</p

    Hypercomplex cross-correlation of DNA sequences

    Full text link
    A hypercomplex representation of DNA is proposed to facilitate comparing DNA sequences with fuzzy composition. With the hypercomplex number representation, the conventional sequence analysis method, such as, dot matrix analysis, dynamic programming, and cross-correlation method have been extended and improved to align DNA sequences with fuzzy composition. The hypercomplex dot matrix analysis can provide more control over the degree of alignment desired. A new scoring system has been proposed to accommodate the hypercomplex number representation of DNA and integrated with dynamic programming alignment method. By using hypercomplex cross-correlation, the match and mismatch alignment information between two aligned DNA sequences are separately stored in the resultant real part and imaginary parts respectively. The mismatch alignment information is very useful to refine consensus sequence based motif scanning

    Localizing triplet periodicity in DNA and cDNA sequences

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The protein-coding regions (coding exons) of a DNA sequence exhibit a triplet periodicity (TP) due to fact that coding exons contain a series of three nucleotide codons that encode specific amino acid residues. Such periodicity is usually not observed in introns and intergenic regions. If a DNA sequence is divided into small segments and a Fourier Transform is applied on each segment, a strong peak at frequency 1/3 is typically observed in the Fourier spectrum of coding segments, but not in non-coding regions. This property has been used in identifying the locations of protein-coding genes in unannotated sequence. The method is fast and requires no training. However, the need to compute the Fourier Transform across a segment (window) of arbitrary size affects the accuracy with which one can localize TP boundaries. Here, we report a technique that provides higher-resolution identification of these boundaries, and use the technique to explore the biological correlates of TP regions in the genome of the model organism <it>C. elegans</it>.</p> <p>Results</p> <p>Using both simulated TP signals and the real <it>C. elegans </it>sequence F56F11 as an example, we demonstrate that, (1) Modified Wavelet Transform (MWT) can better define the boundary of TP region than the conventional Short Time Fourier Transform (STFT); (2) The scale parameter (a) of MWT determines the precision of TP boundary localization: bigger values of a give sharper TP boundaries but result in a lower signal to noise ratio; (3) RNA splicing sites have weaker TP signals than coding region; (4) TP signals in coding region can be destroyed or recovered by frame-shift mutations; (5) 6 bp periodicities in introns and intergenic region can generate false positive signals and it can be removed with 6 bp MWT.</p> <p>Conclusions</p> <p>MWT can provide more precise TP boundaries than STFT and the boundaries can be further refined by bigger scale MWT. Subtraction of 6 bp periodicity signals reduces the number of false positives. Experimentally-introduced frame-shift mutations help recover TP signal that have been lost by possible ancient frame-shifts. More importantly, TP signal has the potential to be used to detect the splice junctions in fully spliced mRNA sequence.</p

    Analyzing Exon Structure with PCA and ICA of Short-Time Fourier Transform

    Get PDF
    Abstract We use principal component analysis (PCA

    Accurate discrimination of conserved coding and non-coding regions through multiple indicators of evolutionary dynamics

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The conservation of sequences between related genomes has long been recognised as an indication of functional significance and recognition of sequence homology is one of the principal approaches used in the annotation of newly sequenced genomes. In the context of recent findings that the number non-coding transcripts in higher organisms is likely to be much higher than previously imagined, discrimination between conserved coding and non-coding sequences is a topic of considerable interest. Additionally, it should be considered desirable to discriminate between coding and non-coding conserved sequences without recourse to the use of sequence similarity searches of protein databases as such approaches exclude the identification of novel conserved proteins without characterized homologs and may be influenced by the presence in databases of sequences which are erroneously annotated as coding.</p> <p>Results</p> <p>Here we present a machine learning-based approach for the discrimination of conserved coding sequences. Our method calculates various statistics related to the evolutionary dynamics of two aligned sequences. These features are considered by a Support Vector Machine which designates the alignment coding or non-coding with an associated probability score.</p> <p>Conclusion</p> <p>We show that our approach is both sensitive and accurate with respect to comparable methods and illustrate several situations in which it may be applied, including the identification of conserved coding regions in genome sequences and the discrimination of coding from non-coding cDNA sequences.</p
    corecore