813,558 research outputs found

    Encoding DNA sequences by integer chaos game representation

    Full text link
    DNA sequences are fundamental for encoding genetic information. The genetic information may not only be understood by symbolic sequences but also from the hidden signals inside the sequences. The symbolic sequences need to be transformed into numerical sequences so the hidden signals can be revealed by signal processing techniques. All current transformation methods encode DNA sequences into numerical values of the same length. These representations have limitations in the applications of genomic signal compression, encryption, and steganography. We propose an integer chaos game representation (iCGR) of DNA sequences and a lossless encoding method DNA sequences by the iCGR. In the iCGR method, a DNA sequence is represented by the iterated function of the nucleotides and their positions in the sequence. Then the DNA sequence can be uniquely encoded and recovered using three integers from iCGR. One integer is the sequence length and the other two integers represent the accumulated distributions of nucleotides in the sequence. The integer encoding scheme can compress a DNA sequence by 2 bits per nucleotide. The integer representation of DNA sequences provides a prospective tool for sequence compression, encryption, and steganography. The Python programs in this study are freely available to the public at https://github.com/cyinbox/iCG

    Entropy concepts and DNA investigations

    Full text link
    Topological and metric entropies of the DNA sequences from different organisms were calculated. Obtained results were compared each other and with ones of corresponding artificial sequences. For all envisaged DNA sequences there is a maximum of heterogeneity. It falls in the block length interval [5,7]. Maximum distinction between natural and artificial sequences is shifted on 1-3 position from the maximum of heterogeneity to the right as for metric as for topological entropy. This point on the specificity of real DNA sequences in the interval.Comment: 10 pages 7 figures submitted to PL

    Construction of a novel phagemid to produce custom DNA origami scaffolds.

    Get PDF
    DNA origami, a method for constructing nanoscale objects, relies on a long single strand of DNA to act as the 'scaffold' to template assembly of numerous short DNA oligonucleotide 'staples'. The ability to generate custom scaffold sequences can greatly benefit DNA origami design processes. Custom scaffold sequences can provide better control of the overall size of the final object and better control of low-level structural details, such as locations of specific base pairs within an object. Filamentous bacteriophages and related phagemids can work well as sources of custom scaffold DNA. However, scaffolds derived from phages require inclusion of multi-kilobase DNA sequences in order to grow in host bacteria, and those sequences cannot be altered or removed. These fixed-sequence regions constrain the design possibilities of DNA origami. Here, we report the construction of a novel phagemid, pScaf, to produce scaffolds that have a custom sequence with a much smaller fixed region of 393 bases. We used pScaf to generate new scaffolds ranging in size from 1512 to 10 080 bases and demonstrated their use in various DNA origami shapes and assemblies. We anticipate our pScaf phagemid will enhance development of the DNA origami method and its future applications

    Mapping the Space of Genomic Signatures

    Full text link
    We propose a computational method to measure and visualize interrelationships among any number of DNA sequences allowing, for example, the examination of hundreds or thousands of complete mitochondrial genomes. An "image distance" is computed for each pair of graphical representations of DNA sequences, and the distances are visualized as a Molecular Distance Map: Each point on the map represents a DNA sequence, and the spatial proximity between any two points reflects the degree of structural similarity between the corresponding sequences. The graphical representation of DNA sequences utilized, Chaos Game Representation (CGR), is genome- and species-specific and can thus act as a genomic signature. Consequently, Molecular Distance Maps could inform species identification, taxonomic classifications and, to a certain extent, evolutionary history. The image distance employed, Structural Dissimilarity Index (DSSIM), implicitly compares the occurrences of oligomers of length up to kk (herein k=9k=9) in DNA sequences. We computed DSSIM distances for more than 5 million pairs of complete mitochondrial genomes, and used Multi-Dimensional Scaling (MDS) to obtain Molecular Distance Maps that visually display the sequence relatedness in various subsets, at different taxonomic levels. This general-purpose method does not require DNA sequence homology and can thus be used to compare similar or vastly different DNA sequences, genomic or computer-generated, of the same or different lengths. We illustrate potential uses of this approach by applying it to several taxonomic subsets: phylum Vertebrata, (super)kingdom Protista, classes Amphibia-Insecta-Mammalia, class Amphibia, and order Primates. This analysis of an extensive dataset confirms that the oligomer composition of full mtDNA sequences can be a source of taxonomic information.Comment: 14 pages, 7 figures. arXiv admin note: substantial text overlap with arXiv:1307.375

    Information decomposition of symbolic sequences

    Full text link
    We developed a non-parametric method of Information Decomposition (ID) of a content of any symbolical sequence. The method is based on the calculation of Shannon mutual information between analyzed and artificial symbolical sequences, and allows the revealing of latent periodicity in any symbolical sequence. We show the stability of the ID method in the case of a large number of random letter changes in an analyzed symbolic sequence. We demonstrate the possibilities of the method, analyzing both poems, and DNA and protein sequences. In DNA and protein sequences we show the existence of many DNA and amino acid sequences with different types and lengths of latent periodicity. The possible origin of latent periodicity for different symbolical sequences is discussed.Comment: 18 pages, 8 figure

    Tomato protoplast DNA transformation: physical linkage and recombination of exogenous DNA sequences

    Get PDF
    Tomato protoplasts have been transformed with plasmid DNA's, containing a chimeric kanamycin resistance gene and putative tomato origins of replication. A calcium phosphate-DNA mediated transformation procedure was employed in combination with either polyethylene glycol or polyvinyl alcohol. There were no indications that the tomato DNA inserts conferred autonomous replication on the plasmids. Instead, Southern blot hybridization analysis of seven kanamycin resistant calli revealed the presence of at least one kanamycin resistance locus per transformant integrated in the tomato nuclear DNA. Generally one to three truncated plasmid copies were found integrated into the tomato nuclear DNA, often physically linked to each other. For one transformant we have been able to use the bacterial ampicillin resistance marker of the vector plasmid pUC9 to 'rescue' a recombinant plasmid from the tomato genome. Analysis of the foreign sequences included in the rescued plasmid showed that integration had occurred in a non-repetitive DNA region. Calf-thymus DNA, used as a carrier in transformation procedure, was found to be covalently linked to plasmid DNA sequences in the genomic DNA of one transformant. A model is presented describing the fate of exogenously added DNA during the transformation of a plant cell. The results are discussed in reference to the possibility of isolating DNA sequences responsible for autonomous replication in tomato.

    Investigations into the molecular effects of single nucleotide polymorphism

    Get PDF
    Objectives: DNA sequences are very rich in short repeats and their pattern can be altered by point mutations. We wanted to investigate the effect of single nucleotide polymorphism (SNP) on the pattern of short DNA repeats and its biological consequences. Methods: Analysis of the pattern of short DNA repeats of the Thy-1 sequence with and without SNP. Searching for DNA-binding factors in any region of significance. Results: Comparing the pattern of short repeats in the Thy-1 gene sequences of Turkish patients with ataxia telangiectasia (AT) with the `wild type' sequence from the DNA database, we identified a missing 8-bp repeat element due to an SNP in position 1271 (intron II) in AT-DNA sequences. Only the mutated sequence had the potential for the formation of a stem loop in DNA or pre-mRNA. In super-shift experiments we found that DNA oligomers covering the area of this SNP formed a complex with proteins amongst which we identified the proliferating cell nuclear antigen (PCNA) protein. Conclusion: SNPs have the potential to alter DNA or pre-mRNA conformation. Although no SNP-depeding formation of the DNA-protein complex was evident, future investigations could reveal differential molecular mechanisms of cellular regulation. Copyright (C) 2001 S. Karger AG, Basel

    Google matrix analysis of DNA sequences

    Get PDF
    For DNA sequences of various species we construct the Google matrix G of Markov transitions between nearby words composed of several letters. The statistical distribution of matrix elements of this matrix is shown to be described by a power law with the exponent being close to those of outgoing links in such scale-free networks as the World Wide Web (WWW). At the same time the sum of ingoing matrix elements is characterized by the exponent being significantly larger than those typical for WWW networks. This results in a slow algebraic decay of the PageRank probability determined by the distribution of ingoing elements. The spectrum of G is characterized by a large gap leading to a rapid relaxation process on the DNA sequence networks. We introduce the PageRank proximity correlator between different species which determines their statistical similarity from the view point of Markov chains. The properties of other eigenstates of the Google matrix are also discussed. Our results establish scale-free features of DNA sequence networks showing their similarities and distinctions with the WWW and linguistic networks.Comment: latex, 11 fig
    corecore