Search CORE

167,027 research outputs found

Compressing DNA sequence databases with coil

Author: Hendy Michael D.
White W. Timothy J.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/05/2008
Field of study

Background: Publicly available DNA sequence databases such as GenBank are large, and are growing at an exponential rate. The sheer volume of data being dealt with presents serious storage and data communications problems. Currently, sequence data is usually kept in large "flat files," which are then compressed using standard Lempel-Ziv (gzip) compression – an approach which rarely achieves good compression ratios. While much research has been done on compressing individual DNA sequences, surprisingly little has focused on the compression of entire databases of such sequences. In this study we introduce the sequence database compression software coil. Results: We have designed and implemented a portable software package, coil, for compressing and decompressing DNA sequence databases based on the idea of edit-tree coding. coil is geared towards achieving high compression ratios at the expense of execution time and memory usage during compression – the compression time represents a "one-off investment" whose cost is quickly amortised if the resulting compressed file is transmitted many times. Decompression requires little memory and is extremely fast. We demonstrate a 5% improvement in compression ratio over state-of-the-art general-purpose compression tools for a large GenBank database file containing Expressed Sequence Tag (EST) data. Finally, coil can efficiently encode incremental additions to a sequence database. Conclusion: coil presents a compelling alternative to conventional compression of flat files for the storage and distribution of DNA sequence databases having a narrow distribution of sequence lengths, such as EST data. Increasing compression levels for databases having a wide distribution of sequence lengths is a direction for future work

Massey Research Online

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Multigene phylogeny and mating tests reveal three cryptic species related to Calonectria pauciramosa

Author: Crous P.W.
Lombard L.
Wingfield B.D.
Wingfield M.J.
Publication venue
Publication date: 01/01/2010
Field of study

Calonectria pauciramosa is a pathogen of numerous plant hosts worldwide. Recent studies have indicated that it included cryptic species, some of which are identified in this study. Isolates from various geographical origins were collected and compared based on morphology, DNA sequence data of the ß-tubulin, histone H3 and translation elongation factor-1 regions and mating compatibility. Comparisons of the DNA sequence data and mating compatibility revealed three new species. These included Ca. colombiana sp. nov. from Colombia, Ca. polizzii sp. nov. from Italy and Ca. zuluensis sp. nov. from South Africa, all of which had distinguishing morphological features. Based on DNA sequence data, Ca. brasiliensis is also elevated to species leve

Elsevier - Publisher Connector

PubMed Central

Wageningen University & Research Publications

UPSpace at the University of Pretoria

DNA Sequence Evolution through Integral Value Transformations

Author: Arunava Goswami
Pabitra Pal Choudhury
Ranita Guha
Shantanav Chakraborty
Sk Sarif Hassan
Sk Sarif Hassan
Publication venue
Publication date: 01/01/2011
Field of study

In deciphering the DNA structures, evolutions and functions, Cellular Automata (CA) do have a significant role. DNA can be thought of as a one-dimensional multi-state CA, more precisely four states of CA namely A, T, C, and G which can be taken as numerals 0, 1, 2 and 3. Earlier, G.Ch. Sirakoulis et al reported the DNA structure, evolution and function through quaternary logic one dimensional CA and the authors have found the simulation results of DNA evolutions with the help of only four linear CA rules. The DNA sequences which are produced through the CA evolutions, however, are seen by our research team not to exist in the established databases of various genomes although the initial seed (initial global state of CA) was taken from the database. This problem motivated us to study the DNA evolutions from a more fundamental point of view. Parallel to the CA paradigm we have devised an enriched set of discrete transformations which have been named as Integral Value Transformations (IVT). Interestingly, on applying the IVT systematically, we have been able to show that each of the DNA sequences at various discrete time instances in IVT evolutions can be directly mapped to a specific DNA sequence existing in the database. This has been possible through our efforts of getting quantitative mathematical parameters of the DNA sequences involving Fractals. Thus we have at our disposal some transformational mechanism between one DNA to another

Crossref

Nature Precedings

Plasmodium falciparum has rare correlation properties

Author: Kushal Shah
Publication venue
Publication date: 23/01/2012
Field of study

A plot of the correlation function of a given DNA sequence has certain characteristic features common to almost all organisms. One common feature is that the correlation values at distances that are multiples of three is higher than correlation values at other distances. Because of this such a correlation plot can be divided into two or three curves with different scalings. P. falciparum has a rare correlation property which is probably unique. I have analyzed genomes of many bacteria, fungi and protozoa and found that P. falciparum is the only organism whose DNA sequence correlation plot can be divided into four curves with different scalings. This property is neither shared by other species of the Plasmodium genus nor by other AT rich genomes. This could be a hint that the DNA sequence of P. falciparum has undergone certain rare mutational events.&#xa

Crossref

Nature Precedings

A novel mathematical tool for generating highly conserved protein domain via different organismal genomic landscapes

Author: Arunava Goswami
Pabitra Pal Choudhury
Rajneesh Singh
Sk. Sarif Hassan
Publication venue
Publication date: 02/09/2010
Field of study

Darwinian evolution hypothesizes that a short stretch of DNA was first constructed and then it expanded to give rise to a long strand. This long strand then produced a mix of exons, introns and repetitive DNA sequence. The order of production of above three kinds of DNA sequence is unknown. Reshuffling of stretches of DNA like above within organisms has given rise to different chromosomes. Till date it is not known how this process is governed. In this paper we show that starting with a sixteen base-pair human olfactory DNA sequence one can form a highly conserved protein domain. Once this domain is formed repetitive DNA sequences of a particular kind starts generating which signifies that this particular conserved protein domain will be unique in nature. The entire mathematical exercise presented in this paper is based on simplest possible context free L-System which we think has been adopted by biological system in general

Crossref

Nature Precedings