10 research outputs found
A Survey of Genomic Traces Reveals a Common Sequencing Error, RNA Editing, and DNA Editing
While it is widely held that an organism's genomic information should remain constant, several protein families are known to modify it. Members of the AID/APOBEC protein family can deaminate DNA. Similarly, members of the ADAR family can deaminate RNA. Characterizing the scope of these events is challenging. Here we use large genomic data sets, such as the two billion sequences in the NCBI Trace Archive, to look for clusters of mismatches of the same type, which are a hallmark of editing events caused by APOBEC3 and ADAR. We align 603,249,815 traces from the NCBI trace archive to their reference genomes. In clusters of mismatches of increasing size, at least one systematic sequencing error dominates the results (G-to-A). It is still present in mismatches with 99% accuracy and only vanishes in mismatches at 99.99% accuracy or higher. The error appears to have entered into about 1% of the HapMap, possibly affecting other users that rely on this resource. Further investigation, using stringent quality thresholds, uncovers thousands of mismatch clusters with no apparent defects in their chromatograms. These traces provide the first reported candidates of endogenous DNA editing in human, further elucidating RNA editing in human and mouse and also revealing, for the first time, extensive RNA editing in Xenopus tropicalis. We show that the NCBI Trace Archive provides a valuable resource for the investigation of the phenomena of DNA and RNA editing, as well as setting the stage for a comprehensive mapping of editing events in large-scale genomic datasets
Recommended from our members
Accurate Whole-Genome Sequencing and Haplotyping from 10 to 20 Human Cells
Recent advances in whole genome sequencing have brought the vision of personal genomics and genomic medicine closer to reality. However, current methods lack clinical accuracy and the ability to describe the context (haplotypes) in which genome variants co-occur in a cost-effective manner. Here we describe a low-cost DNA sequencing and haplotyping process, Long Fragment Read (LFR) technology, similar to sequencing long single DNA molecules without cloning or separation of metaphase chromosomes. In this study, ten LFR libraries were made using only ~100 pg of human DNA per sample. Up to 97% of the heterozygous single nucleotide variants (SNVs) were assembled into long haplotype contigs. Removal of false positive SNVs not phased by multiple LFR haplotypes resulted in a final genome error rate of 1 in 10 Mb. Cost-effective and accurate genome sequencing and haplotyping from 10-20 human cells, as demonstrated here, will enable comprehensive genetic studies and diverse clinical applications
Identification of Widespread Ultra-Edited Human RNAs
Adenosine-to-inosine modification of RNA molecules (A-to-I RNA editing) is an important mechanism that increases transciptome diversity. It occurs when a genomically encoded adenosine (A) is converted to an inosine (I) by ADAR proteins. Sequencing reactions read inosine as guanosine (G); therefore, current methods to detect A-to-I editing sites align RNA sequences to their corresponding DNA regions and identify A-to-G mismatches. However, such methods perform poorly on RNAs that underwent extensive editing (“ultra”-editing), as the large number of mismatches obscures the genomic origin of these RNAs. Therefore, only a few anecdotal ultra-edited RNAs have been discovered so far. Here we introduce and apply a novel computational method to identify ultra-edited RNAs. We detected 760 ESTs containing 15,646 editing sites (more than 20 sites per EST, on average), of which 13,668 are novel. Ultra-edited RNAs exhibit the known sequence motif of ADARs and tend to localize in sense strand Alu elements. Compared to sites of mild editing, ultra-editing occurs primarily in Alu-rich regions, where potential base pairing with neighboring, inverted Alus creates particularly long double-stranded RNA structures. Ultra-editing sites are underrepresented in old Alu subfamilies, tend to be non-conserved, and avoid exons, suggesting that ultra-editing is usually deleterious. A possible biological function of ultra-editing could be mediated by non-canonical splicing and cleavage of the RNA near the editing sites
Harvard Personal Genome Project: Lessons from Participatory Public Research
Background: Since its initiation in 2005, the Harvard Personal Genome Project has enrolled thousands of volunteers interested in publicly sharing their genome, health and trait data. Because these data are highly identifiable, we use an 'open consent' framework that purposefully excludes promises about privacy and requires participants to demonstrate comprehension prior to enrollment.Discussion: Our model of non-anonymous, public genomes has led us to a highly participatory model of researcher-participant communication and interaction. The participants, who are highly committed volunteers, self-pursue and donate research-relevant datasets, and are actively engaged in conversations with both our staff and other Personal Genome Project participants. We have quantitatively assessed these communications and donations, and report our experiences with returning research-grade whole genome data to participants. We also observe some of the community growth and discussion that has occurred related to our project.Summary: We find that public non-anonymous data is valuable and leads to a participatory research model, which we encourage others to consider. The implementation of this model is greatly facilitated by web-based tools and methods and participant education. Project results are long-term proactive participant involvement and the growth of a community that benefits both researchers and participants. © 2014 Ball et al.; licensee BioMed Central Ltd
Recommended from our members
The whole genome sequences and experimentally phased haplotypes of over 100 personal genomes
Background: Since the completion of the Human Genome Project in 2003, it is estimated that more than 200,000 individual whole human genomes have been sequenced. A stunning accomplishment in such a short period of time. However, most of these were sequenced without experimental haplotype data and are therefore missing an important aspect of genome biology. In addition, much of the genomic data is not available to the public and lacks phenotypic information. Findings: As part of the Personal Genome Project, blood samples from 184 participants were collected and processed using Complete Genomics’ Long Fragment Read technology. Here, we present the experimental whole genome haplotyping and sequencing of these samples to an average read coverage depth of 100X. This is approximately three-fold higher than the read coverage applied to most whole human genome assemblies and ensures the highest quality results. Currently, 114 genomes from this dataset are freely available in the GigaDB repository and are associated with rich phenotypic data; the remaining 70 should be added in the near future as they are approved through the PGP data release process. For reproducibility analyses, 20 genomes were sequenced at least twice using independent LFR barcoded libraries. Seven genomes were also sequenced using Complete Genomics’ standard non-barcoded library process. In addition, we report 2.6 million high-quality, rare variants not previously identified in the Single Nucleotide Polymorphisms database or the 1000 Genomes Project Phase 3 data. Conclusions: These genomes represent a unique source of haplotype and phenotype data for the scientific community and should help to expand our understanding of human genome evolution and function. Electronic supplementary material The online version of this article (doi:10.1186/s13742-016-0148-z) contains supplementary material, which is available to authorized users
A public resource facilitating clinical use of genomes
Rapid advances in DNA sequencing promise to enable new diagnostics and individualized therapies. Achieving personalized medicine, however, will require extensive research on highly reidentifiable, integrated datasets of genomic and health information. To assist with this, participants in the Personal Genome Project choose to forgo privacy via our institutional review boardapproved "open consent" process. The contribution of public data and samples facilitates both scientific discovery and standardization of methods. We present our findings after enrollment of more than 1,800 participants, including whole-genome sequencing of 10 pilot participant genomes (the PGP-10).We introduce the Genome-Environment-Trait Evidence (GET-Evidence) system. This tool automatically processes genomes and prioritizes both published and novel variants for interpretation. In the process of reviewing the presumed healthy PGP-10 genomes, we find numerous literature references implying serious disease. Although it is sometimes impossible to rule out a late-onset effect, stringent evidence requirements can address the high rate of incidental findings. To that end we develop a peer production system for recording and organizing variant evaluations according to standard evidence guidelines, creating a public forum for reaching consensus on interpretation of clinically relevant variants. Genome analysis becomes a two-step process: using a prioritized list to record variant evaluations, then automatically sorting reviewed variants using these annotations. Genome data, health and trait information, participant samples, and variant interpretations are all shared in the public domain - we invite others to review our results using our participant samples and contribute to our interpretations. We offer our public resource and methods to further personalized medical research.close555