167,027 research outputs found
Compressing DNA sequence databases with coil
Background: Publicly available DNA sequence databases such as GenBank are large, and are
growing at an exponential rate. The sheer volume of data being dealt with presents serious storage
and data communications problems. Currently, sequence data is usually kept in large "flat files,"
which are then compressed using standard Lempel-Ziv (gzip) compression – an approach which
rarely achieves good compression ratios. While much research has been done on compressing
individual DNA sequences, surprisingly little has focused on the compression of entire databases
of such sequences. In this study we introduce the sequence database compression software coil.
Results: We have designed and implemented a portable software package, coil, for compressing
and decompressing DNA sequence databases based on the idea of edit-tree coding. coil is geared
towards achieving high compression ratios at the expense of execution time and memory usage
during compression – the compression time represents a "one-off investment" whose cost is
quickly amortised if the resulting compressed file is transmitted many times. Decompression
requires little memory and is extremely fast. We demonstrate a 5% improvement in compression
ratio over state-of-the-art general-purpose compression tools for a large GenBank database file
containing Expressed Sequence Tag (EST) data. Finally, coil can efficiently encode incremental
additions to a sequence database.
Conclusion: coil presents a compelling alternative to conventional compression of flat files for the
storage and distribution of DNA sequence databases having a narrow distribution of sequence
lengths, such as EST data. Increasing compression levels for databases having a wide distribution of
sequence lengths is a direction for future work
Multigene phylogeny and mating tests reveal three cryptic species related to Calonectria pauciramosa
Calonectria pauciramosa is a pathogen of numerous plant hosts worldwide. Recent studies have indicated that it included cryptic species, some of which are identified in this study. Isolates from various geographical origins were collected and compared based on morphology, DNA sequence data of the Ăź-tubulin, histone H3 and translation elongation factor-1 regions and mating compatibility. Comparisons of the DNA sequence data and mating compatibility revealed three new species. These included Ca. colombiana sp. nov. from Colombia, Ca. polizzii sp. nov. from Italy and Ca. zuluensis sp. nov. from South Africa, all of which had distinguishing morphological features. Based on DNA sequence data, Ca. brasiliensis is also elevated to species leve
DNA Sequence Evolution through Integral Value Transformations
In deciphering the DNA structures, evolutions and functions, Cellular Automata (CA) do have a significant role. DNA can be thought of as a one-dimensional multi-state CA, more precisely four states of CA namely A, T, C, and G which can be taken as numerals 0, 1, 2 and 3. Earlier, G.Ch. Sirakoulis et al reported the DNA structure, evolution and function through quaternary logic one dimensional CA and the authors have found the simulation results of DNA evolutions with the help of only four linear CA rules. The DNA sequences which are produced through the CA evolutions, however, are seen by our research team not to exist in the established databases of various genomes although the initial seed (initial global state of CA) was taken from the database. This problem motivated us to study the DNA evolutions from a more fundamental point of view. Parallel to the CA paradigm we have devised an enriched set of discrete transformations which have been named as Integral Value Transformations (IVT). Interestingly, on applying the IVT systematically, we have been able to show that each of the DNA sequences at various discrete time instances in IVT evolutions can be directly mapped to a specific DNA sequence existing in the database. This has been possible through our efforts of getting quantitative mathematical parameters of the DNA sequences involving Fractals. Thus we have at our disposal some transformational mechanism between one DNA to another
Plasmodium falciparum has rare correlation properties
A plot of the correlation function of a given DNA sequence has certain characteristic features common to almost all organisms. One common feature is that the correlation values at distances that are multiples of three is higher than correlation values at other distances. Because of this such a correlation plot can be divided into two or three curves with different scalings. P. falciparum has a rare correlation property which is probably unique. I have analyzed genomes of many bacteria, fungi and protozoa and found that P. falciparum is the only organism whose DNA sequence correlation plot can be divided into four curves with different scalings. This property is neither shared by other species of the Plasmodium genus nor by other AT rich genomes. This could be a hint that the DNA sequence of P. falciparum has undergone certain rare mutational events.

A novel mathematical tool for generating highly conserved protein domain via different organismal genomic landscapes
Darwinian evolution hypothesizes that a short stretch of DNA was first constructed and then it expanded to give rise to a long strand. This long strand then produced a mix of exons, introns and repetitive DNA sequence. The order of production of above three kinds of DNA sequence is unknown. Reshuffling of stretches of DNA like above within organisms has given rise to different chromosomes. Till date it is not known how this process is governed. In this paper we show that starting with a sixteen base-pair human olfactory DNA sequence one can form a highly conserved protein domain. Once this domain is formed repetitive DNA sequences of a particular kind starts generating which signifies that this particular conserved protein domain will be unique in nature. The entire mathematical exercise presented in this paper is based on simplest possible context free L-System which we think has been adopted by biological system in general
- …