A Novel DNA Sequence Compression Method Based on Chaos Game Representation

Abstract

Unique signature images derived out of Chaos Game Representation of bio-sequences is an area of research that has been confined to pattern recognition applications. In this paper we pose and answer an interesting question – can we reproduce a bio-sequence in a lossless way given the co-ordinates of the final point in its CGR image? We show that it is possible in principle, but would need enormous resolution for representation of coordinates, roughly corresponding to the information content of direct binary coding of the sequence. We go on to show that we can code nucleotide codon triplets using this method in which 16 codons can be coded using 4 bits, the remaining 48 using 6 bits. Theoretically up to 11% compression is possible with this method. However, algorithm overheads reduce this to very nominal compression percentage of less than 4% for human genome and 9% for bacterial genome. We report the results on a subset of standard test sequences and also an independent wider data set

    Similar works