167,027 research outputs found

    Compressing DNA sequence databases with coil

    Get PDF
    Background: Publicly available DNA sequence databases such as GenBank are large, and are growing at an exponential rate. The sheer volume of data being dealt with presents serious storage and data communications problems. Currently, sequence data is usually kept in large "flat files," which are then compressed using standard Lempel-Ziv (gzip) compression – an approach which rarely achieves good compression ratios. While much research has been done on compressing individual DNA sequences, surprisingly little has focused on the compression of entire databases of such sequences. In this study we introduce the sequence database compression software coil. Results: We have designed and implemented a portable software package, coil, for compressing and decompressing DNA sequence databases based on the idea of edit-tree coding. coil is geared towards achieving high compression ratios at the expense of execution time and memory usage during compression – the compression time represents a "one-off investment" whose cost is quickly amortised if the resulting compressed file is transmitted many times. Decompression requires little memory and is extremely fast. We demonstrate a 5% improvement in compression ratio over state-of-the-art general-purpose compression tools for a large GenBank database file containing Expressed Sequence Tag (EST) data. Finally, coil can efficiently encode incremental additions to a sequence database. Conclusion: coil presents a compelling alternative to conventional compression of flat files for the storage and distribution of DNA sequence databases having a narrow distribution of sequence lengths, such as EST data. Increasing compression levels for databases having a wide distribution of sequence lengths is a direction for future work

    Multigene phylogeny and mating tests reveal three cryptic species related to Calonectria pauciramosa

    Get PDF
    Calonectria pauciramosa is a pathogen of numerous plant hosts worldwide. Recent studies have indicated that it included cryptic species, some of which are identified in this study. Isolates from various geographical origins were collected and compared based on morphology, DNA sequence data of the Ăź-tubulin, histone H3 and translation elongation factor-1 regions and mating compatibility. Comparisons of the DNA sequence data and mating compatibility revealed three new species. These included Ca. colombiana sp. nov. from Colombia, Ca. polizzii sp. nov. from Italy and Ca. zuluensis sp. nov. from South Africa, all of which had distinguishing morphological features. Based on DNA sequence data, Ca. brasiliensis is also elevated to species leve

    DNA Sequence Evolution through Integral Value Transformations

    Get PDF
    In deciphering the DNA structures, evolutions and functions, Cellular Automata (CA) do have a significant role. DNA can be thought of as a one-dimensional multi-state CA, more precisely four states of CA namely A, T, C, and G which can be taken as numerals 0, 1, 2 and 3. Earlier, G.Ch. Sirakoulis et al reported the DNA structure, evolution and function through quaternary logic one dimensional CA and the authors have found the simulation results of DNA evolutions with the help of only four linear CA rules. The DNA sequences which are produced through the CA evolutions, however, are seen by our research team not to exist in the established databases of various genomes although the initial seed (initial global state of CA) was taken from the database. This problem motivated us to study the DNA evolutions from a more fundamental point of view. Parallel to the CA paradigm we have devised an enriched set of discrete transformations which have been named as Integral Value Transformations (IVT). Interestingly, on applying the IVT systematically, we have been able to show that each of the DNA sequences at various discrete time instances in IVT evolutions can be directly mapped to a specific DNA sequence existing in the database. This has been possible through our efforts of getting quantitative mathematical parameters of the DNA sequences involving Fractals. Thus we have at our disposal some transformational mechanism between one DNA to another

    Plasmodium falciparum has rare correlation properties

    Get PDF
    A plot of the correlation function of a given DNA sequence has certain characteristic features common to almost all organisms. One common feature is that the correlation values at distances that are multiples of three is higher than correlation values at other distances. Because of this such a correlation plot can be divided into two or three curves with different scalings. P. falciparum has a rare correlation property which is probably unique. I have analyzed genomes of many bacteria, fungi and protozoa and found that P. falciparum is the only organism whose DNA sequence correlation plot can be divided into four curves with different scalings. This property is neither shared by other species of the Plasmodium genus nor by other AT rich genomes. This could be a hint that the DNA sequence of P. falciparum has undergone certain rare mutational events.
&#xa

    A novel mathematical tool for generating highly conserved protein domain via different organismal genomic landscapes

    Get PDF
    Darwinian evolution hypothesizes that a short stretch of DNA was first constructed and then it expanded to give rise to a long strand. This long strand then produced a mix of exons, introns and repetitive DNA sequence. The order of production of above three kinds of DNA sequence is unknown. Reshuffling of stretches of DNA like above within organisms has given rise to different chromosomes. Till date it is not known how this process is governed. In this paper we show that starting with a sixteen base-pair human olfactory DNA sequence one can form a highly conserved protein domain. Once this domain is formed repetitive DNA sequences of a particular kind starts generating which signifies that this particular conserved protein domain will be unique in nature. The entire mathematical exercise presented in this paper is based on simplest possible context free L-System which we think has been adopted by biological system in general
    • …
    corecore