201 research outputs found

    Database of exact tandem repeats in the Zebrafish genome

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Sequencing of the approximately 1.7 billion bases of the zebrafish genome is currently underway. To date, few high resolution genetic maps exist for the zebrafish genome, based mainly on single nucleotide polymorphisms (SNPs) and short microsatellite repeats. The desire to construct a higher resolution genetic map led to the construction of a database of tandemly repeating elements within the zebrafish Zv8 assembly.</p> <p>Description</p> <p>Exact tandem repeats with a repeat length of at least three bases and a copy number of at least 10 were reported. Repeats with a total length of 250 or fewer bases and their flanking regions were masked for known vertebrate repeats. Optimal primer pairs were computationally designed in the regions flanking the detected repeats. This database of exact tandem repeats can then be used as a resource by molecular biologists with interests in experimentally testing VNTRs within a zebrafish population.</p> <p>Conclusions</p> <p>A total of 116,915 repeats with a base length of at least three nucleotides were detected. The longest of these was a 54-base repeat with fourteen tandem copies. A significant number of repeats with a base length of 18, 24, 27 and 30 were detected, many with potentially novel proline-rich coding regions.</p> <p>Detection of exact tandem repeats in the zebrafish genome leads to a wealth of information regarding potential polymorphic sites for VNTRs. The association of many of these repeats with potentially novel yet similar coding regions yields an exciting potential for disease associated genes. A web interface for querying repeats is available at <url>http://bioinformatics.louisville.edu/zebrafish/</url>. This portal allows for users to search for a repeats of a selected base size from any valid specified region within the 25 linkage groups.</p

    Assembly and Compositional Analysis of Human Genomic DNA - Doctoral Dissertation, August 2002

    Get PDF
    In 1990, the United States Human Genome Project was initiated as a fifteen-year endeavor to sequence the approximately three billion bases making up the human genome (Vaughan, 1996).As of December 31, 2001, the public sequencing efforts have sequenced a total of 2.01 billion finished bases representing 63.0% of the human genome (http://www.ncbi.nlm.nih.gov/genome/seq/page.cgi?F=HsProgress.shtml&&ORG=Hs) to a Bermuda quality error rate of 1/10000 (Smith and Carrano, 1996). In addition, 1.11 billion bases representing 34.8% of the human genome has been sequenced to a rough-draft level. Efforts such as UCSC\u27s GoldenPath (Kent and Haussler, 2001) and NCBI\u27s contig assembly (Jang et al., 1999) attempt to assemble the human genome by incorporating both finished and rough-draft sequence. The availability of the human genome data allows us to ask questions concerning the maintenance of specific regions of the human genome. We consider two hypotheses for maintenance of high G+C regions: the presence of specific repetitive elements and compositional mutation biases. Our results rule out the possibility of the G+C content of repetitive elements determining regions of high and low G+C regions in the human genome. We determine that there is a compositional bias for mutation rates. However, these biases are not responsible for the maintenance of high G+C regions. In addition, we show that regions of the human under less selective pressure will mutate towards a higher A+T composition, regardless of the surrounding G+C composition. We also analyze sequence organization and show that previous studies of isochore regions (Bernardi,1993) cannot be generalized within the human genome. In addition, we propose a method to assemble only those parts of the human genome that are finished into larger contigs. Analysis of the contigs can lead to the mining of meaningful biological data that can give insights into genetic variation and evolution. I suggest a method to help aid in single nucleotide polymorphism (SNP)detection, which can help to determine differences within a population. I also discuss a dynamic-programming based approach to sequence assembly validation and detection of large-scale polymorphisms within a population that is made possible through the availability of large human sequence contigs

    Pattern Matching Techniques and Their Applications to Computational Molecular Biology - A Review

    Get PDF
    Pattern matching techniques have been useful in solving many problems associated with computer science, including data compression (Chrochemore and Lecroq, 1996), data encryption (RSA Laboratories, 1993), and computer vision (Grimson and Huttenlocher, 1990). In recent years, developments in molecular biology have led to large scale sequencing of genomic DNA. Since this data is being produced in such rapid fasion, tools to analyze DNA segments are desired. The goal here is to discuss various techniques and tools for solving various pattern matching questions in computational biology, including optimal sequence alignment, multiple sequence alignment, and buidling models to describe sequence families using Hidden Markov Models (HMMs) and regular expressions

    Sequence Assembly Validation by Restriction Digest Fingerprint Comparison

    Get PDF
    DNA sequence analysis depends on the accurate assembly of fragment reads for the determination of a consensus sequence. Genomic sequences frequently contain repeat elements that may confound the fragment assembly process, and errors in fragment assembly, and errors in fragment assembly may seriously impact the biological interpretation of the sequence data. Validating the fidelity of sequence assembly by experimental means is desirable. This report examines the use of restriction digest analysis as a method for testing the fidelity of sequence assembly. Restriction digest fingerprint matching is an established technology for high resolution physical map construction, but the requirements for assembly validation differ from those of fingerprint mapping. Fingerprint matching is a statistical process that is robust to the presence of errors in the data and independent of absolute fragment mass determination. Assembly validation depends on the recognition of a small number of discrepant fragments and is very sensitive to both false positive and false negative errors in the data. Assembly validation relies on the comparison of absolute masses derived from sequence with masses that are experimenally determined, making absolute accuracy as well as experimental precision important. As the size of a sequencing project increases, the difficulties in assembly validation by restriction fingerprinting befcome more severe. Simulation studies are used to demonstrate that large-scale errors in sequence assembly can escape detection in fingerprint pattern comparison. Alternative technologies for sequence assembly validation are discussed

    Compositional Analysis of Homogeneous Regions in Human Genomic DNA

    Get PDF
    Due to increased production of human DNA sequence, it is now possible to explore and understand human genomic organization at the sequence level. In particular, we have studied one of the major organizational components of vertebrate genome organization previously described as isochores (Bernardi, 1993), which are compositionally homogeneous DNA segments based on G+C content. We have examined sequence data for the existence of compositionally differing regions and report that while compositionally homogeneous regions are present in the human genome, current isochore classification schemes are too brad for sequence-level data

    Assembly and Analysis of Extended Human Genomic Contig Regions

    Get PDF
    The Human Genome Project (HGP) has led to the deposit of human genomic sequence in the form of sequenced clones into various databases such as the DNA Data Bank of Japan (DDBJ) (Tateno and Gojobori, 1997), the European Molecular Biology Laboratory (EMBL) Nucleotide Sequence Database (Stoesser, et. al., 1999), and GenBank (Benson, et. al., 1998). Many of these sequenced clones occur in regions where sequencing has taken place either within the same sequencing center or other centers throughout the world. The assembly of extended segments of genomic sequence by looking at overlapping end segments is desired and is currently availabel only in a limited sense from the National Center for Biotechnology Information (NCBI) (http://www.ncbi.nlm.nih.gov/genome/seq/) and Oak Ridge National Laboratories\u27 (ORNL) Genome Channel (http://compbio.ornl.gov/tools/channel/). We attempt to collate a definitive set of nonredundant extended segments of human genomic sequence by taking individual human entires in GenBank greater than 25 kilobases (kb) and extending them on either end. We address the several difficulties that arise when attempting to extend segments

    Computational Detection of CpG Islands in DNA

    Get PDF
    Regions of DNA rich in CpG dinucleotides, also known as CpG islands, are often located upstream of the transcription start side in both tissue specific and housekeeping genes. Overall, CPG dinucleotides are observed at a density of 25% the expected level from base composition alone, partially due to 5-methylcytosine decay (Bird, 1993). Since CpG dinucleotides typically occur with low frequency, CpG islands can be distinguished statistically in the genome. Our method of detecting CpG islands involves a heuristic algorithm employing classic changepoint methods and log-likelihood statistics. A Java applet has been created to allow for user interaction and visualization of the segmentation resulting from the changepoint analysis. The model is tested using several sequences obtainable from GenBank (NCBI, 1997), including a 220 Kb fragment of human X chromosome from the filanin (FLM) gene to the glucose-6-phosphate dehydrogenase (G6PD) gene which has been experimentally studied (Rivella, et. al., 1995; E.Y. Chen, et. all., 1996). Preliminary results suggest a breakpoint segmentation that is consistent with observable manual analysis. About 56% of human genes have associated CpG rich islands (Antequera and Bird, 1993). By identifying the CpG islands, it is thought that regions of DNA coding for housekeeping or tissue-specific genes can be located (Antequera and Bird, 1993) even in the absence of transcriptional activity. Biological experiments searching for such genes can then be narrowed given the locations of the CpG islands
    • …
    corecore