Computational Detection of CpG Islands in DNA

Abstract

Regions of DNA rich in CpG dinucleotides, also known as CpG islands, are often located upstream of the transcription start side in both tissue specific and housekeeping genes. Overall, CPG dinucleotides are observed at a density of 25% the expected level from base composition alone, partially due to 5-methylcytosine decay (Bird, 1993). Since CpG dinucleotides typically occur with low frequency, CpG islands can be distinguished statistically in the genome. Our method of detecting CpG islands involves a heuristic algorithm employing classic changepoint methods and log-likelihood statistics. A Java applet has been created to allow for user interaction and visualization of the segmentation resulting from the changepoint analysis. The model is tested using several sequences obtainable from GenBank (NCBI, 1997), including a 220 Kb fragment of human X chromosome from the filanin (FLM) gene to the glucose-6-phosphate dehydrogenase (G6PD) gene which has been experimentally studied (Rivella, et. al., 1995; E.Y. Chen, et. all., 1996). Preliminary results suggest a breakpoint segmentation that is consistent with observable manual analysis. About 56% of human genes have associated CpG rich islands (Antequera and Bird, 1993). By identifying the CpG islands, it is thought that regions of DNA coding for housekeeping or tissue-specific genes can be located (Antequera and Bird, 1993) even in the absence of transcriptional activity. Biological experiments searching for such genes can then be narrowed given the locations of the CpG islands

    Similar works