136 research outputs found

    CGHub: Kick-starting the Worldwide Genome Web

    Get PDF
    The University of California, Santa Cruz (UCSC) is under contract with the National Cancer Institute (NCI) to construct and operate the Cancer Genomics Hub (CGHub), a nation-scale library and user portal for cancer genomics data.  This contract covers growth of the library to 5 Petabytes. The NCI programs that feed into the library currently produce about 20 terabytes of data each month. We discuss the receiver-driven file transfer mechanism Annai GeneTorrent (GT) for use with the library. Annai GT uses multiple TCP streams from multiple computers at the library site to parallelize genome downloads.  We review our performance experience with the new transfer mechanism and also explain additions to the transfer protocol to support the security required in handling patient cancer genomics data

    Retrocopy contributions to the evolution of the human genome

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Evolution via point mutations is a relatively slow process and is unlikely to completely explain the differences between primates and other mammals. By contrast, 45% of the human genome is composed of retroposed elements, many of which were inserted in the primate lineage. A subset of retroposed mRNAs (retrocopies) shows strong evidence of expression in primates, often yielding functional retrogenes.</p> <p>Results</p> <p>To identify and analyze the relatively recently evolved retrogenes, we carried out BLASTZ alignments of all human mRNAs against the human genome and scored a set of features indicative of retroposition. Of over 12,000 putative retrocopy-derived genes that arose mainly in the primate lineage, 726 with strong evidence of transcript expression were examined in detail. These mRNA retroposition events fall into three categories: I) 34 retrocopies and antisense retrocopies that added potential protein coding space and UTRs to existing genes; II) 682 complete retrocopy duplications inserted into new loci; and III) an unexpected set of 13 retrocopies that contributed out-of-frame, or antisense sequences in combination with other types of transposed elements (SINEs, LINEs, LTRs), even unannotated sequence to form potentially novel genes with no homologs outside primates. In addition to their presence in human, several of the gene candidates also had potentially viable ORFs in chimpanzee, orangutan, and rhesus macaque, underscoring their potential of function.</p> <p>Conclusion</p> <p>mRNA-derived retrocopies provide raw material for the evolution of genes in a wide variety of ways, duplicating and amending the protein coding region of existing genes as well as generating the potential for new protein coding space, or non-protein coding RNAs, by unexpected contributions out of frame, in reverse orientation, or from previously non-protein coding sequence.</p

    The UCSC Proteome Browser

    Get PDF
    The University of California Santa Cruz (UCSC) Proteome Browser provides a wealth of protein information presented in graphical images and with links to other protein-related Internet sites. The Proteome Browser is tightly integrated with the UCSC Genome Browser. For the first time, Genome Browser users have both the genome and proteome worlds at their fingertips simultaneously. The Proteome Browser displays tracks of protein and genomic sequences, exon structure, polarity, hydrophobicity, locations of cysteine and glycosylation potential, Superfamily domains and amino acids that deviate from normal abundance. Histograms show genome-wide distribution of protein properties, including isoelectric point, molecular weight, number of exons, InterPro domains and cysteine locations, together with specific property values of the selected protein. The Proteome Browser also provides links to gene annotations in the Genome Browser, the Known Genes details page and the Gene Sorter; domain information from Superfamily, InterPro and Pfam; three-dimensional structures at the Protein Data Bank and ModBase; and pathway data at KEGG, BioCarta/CGAP and BioCyc. As of August 2004, the Proteome Browser is available for human, mouse and rat proteomes. The browser may be accessed from any Known Genes details page of the Genome Browser at http://genome.ucsc.edu. A user's guide is also available on this website

    Systematic assessment of long-read RNA-seq methods for transcript identification and quantification

    Get PDF
    The Long-read RNA-Seq Genome Annotation Assessment Project Consortium was formed to evaluate the effectiveness of long-read approaches for transcriptome analysis. Using different protocols and sequencing platforms, the consortium generated over 427 million long-read sequences from complementary DNA and direct RNA datasets, encompassing human, mouse and manatee species. Developers utilized these data to address challenges in transcript isoform detection, quantification and de novo transcript detection. The study revealed that libraries with longer, more accurate sequences produce more accurate transcripts than those with increased read depth, whereas greater read depth improved quantification accuracy. In well-annotated genomes, tools based on reference sequences demonstrated the best performance. Incorporating additional orthogonal data and replicate samples is advised when aiming to detect rare and novel transcripts or using reference-free approaches. This collaborative study offers a benchmark for current practices and provides direction for future method development in transcriptome analysis.The Long-read RNA-Seq Genome Annotation Assessment Project Consortium was formed to evaluate the effectiveness of long-read approaches for transcriptome analysis. Using different protocols and sequencing platforms, the consortium generated over 427 million long-read sequences from complementary DNA and direct RNA datasets, encompassing human, mouse and manatee species. Developers utilized these data to address challenges in transcript isoform detection, quantification and de novo transcript detection. The study revealed that libraries with longer, more accurate sequences produce more accurate transcripts than those with increased read depth, whereas greater read depth improved quantification accuracy. In well-annotated genomes, tools based on reference sequences demonstrated the best performance. Incorporating additional orthogonal data and replicate samples is advised when aiming to detect rare and novel transcripts or using reference-free approaches. This collaborative study offers a benchmark for current practices and provides direction for future method development in transcriptome analysis.Peer Reviewed"Article signat per més de 40 autors/es: Francisco J. Pardo-Palacios, Dingjie Wang, Fairlie Reese, Mark Diekhans, Sílvia Carbonell-Sala, Brian Williams, Jane E. Loveland, Maite De María, Matthew S. Adams, Gabriela Balderrama-Gutierrez, Amit K. Behera, Jose M. Gonzalez Martinez, Toby Hunt, Julien Lagarde, Cindy E. Liang, Haoran Li, Marcus Jerryd Meade, David A. Moraga Amador, Andrey D. Prjibelski, Inanc Birol, Hamed Bostan, Ashley M. Brooks, Muhammed Hasan Çelik, Ying Chen, Mei R. M. Du, Colette Felton, Jonathan Göke, Saber Hafezqorani, Ralf Herwig, Hideya Kawaji, Joseph Lee, Jian-Liang Li, Matthias Lienhard, Alla Mikheenko, Dennis Mulligan, Ka Ming Nip, Mihaela Pertea, Matthew E. Ritchie, Andre D. Sim, Alison D. Tang, Yuk Kei Wan, Changqing Wang, Brandon Y. Wong, Chen Yang, If Barnes, Andrew E. Berry, Salvador Capella-Gutierrez, Alyssa Cousineau, Namrita Dhillon, Jose M. Fernandez-Gonzalez, Luis Ferrández-Peral, Natàlia Garcia-Reyero, Stefan Götz, Carles Hernández-Ferrer, Liudmyla Kondratova, Tianyuan Liu, Alessandra Martinez-Martin, Carlos Menor, Jorge Mestre-Tomás, Jonathan M. Mudge, Nedka G. Panayotova, Alejandro Paniagua, Dmitry Repchevsky, Xingjie Ren, Eric Rouchka, Brandon Saint-John, Enrique Sapena, Leon Sheynkman, Melissa Laird Smith, Marie-Marthe Suner, Hazuki Takahashi, Ingrid A. Youngworth, Piero Carninci, Nancy D. Denslow, Roderic Guigó, Margaret E. Hunter, Rene Maehr, Yin Shen, Hagen U. Tilgner, Barbara J. Wold, Christopher Vollmers, Adam Frankish, Kin Fai Au, Gloria M. Sheynkman, Ali Mortazavi, Ana Conesa & Angela N. Brooks"Postprint (published version

    Comparative Genomics Search for Losses of Long-Established Genes on the Human Lineage

    Get PDF
    Taking advantage of the complete genome sequences of several mammals, we developed a novel method to detect losses of well-established genes in the human genome through syntenic mapping of gene structures between the human, mouse, and dog genomes. Unlike most previous genomic methods for pseudogene identification, this analysis is able to differentiate losses of well-established genes from pseudogenes formed shortly after segmental duplication or generated via retrotransposition. Therefore, it enables us to find genes that were inactivated long after their birth, which were likely to have evolved nonredundant biological functions before being inactivated. The method was used to look for gene losses along the human lineage during the approximately 75 million years (My) since the common ancestor of primates and rodents (the euarchontoglire crown group). We identified 26 losses of well-established genes in the human genome that were all lost at least 50 My after their birth. Many of them were previously characterized pseudogenes in the human genome, such as GULO and UOX. Our methodology is highly effective at identifying losses of single-copy genes of ancient origin, allowing us to find a few well-known pseudogenes in the human genome missed by previous high-throughput genome-wide studies. In addition to confirming previously known gene losses, we identified 16 previously uncharacterized human pseudogenes that are definitive losses of long-established genes. Among them is ACYL3, an ancient enzyme present in archaea, bacteria, and eukaryotes, but lost approximately 6 to 8 Mya in the ancestor of humans and chimps. Although losses of well-established genes do not equate to adaptive gene losses, they are a useful proxy to use when searching for such genetic changes. This is especially true for adaptive losses that occurred more than 250,000 years ago, since any genetic evidence of the selective sweep indicative of such an event has been erased

    CHASM and SNVBox: toolkit for detecting biologically important single nucleotide mutations in cancer

    Get PDF
    Summary: Thousands of cancer exomes are currently being sequenced, yielding millions of non-synonymous single nucleotide variants (SNVs) of possible relevance to disease etiology. Here, we provide a software toolkit to prioritize SNVs based on their predicted contribution to tumorigenesis. It includes a database of precomputed, predictive features covering all positions in the annotated human exome and can be used either stand-alone or as part of a larger variant discovery pipeline

    AMELIE speeds Mendelian diagnosis by matching patient phenotype and genotype to primary literature

    Get PDF
    The diagnosis of Mendelian disorders requires labor-intensive literature research. Trained clinicians can spend hours looking for the right publication(s) supporting a single gene that best explains a patient’s disease. AMELIE (Automatic Mendelian Literature Evaluation) greatly accelerates this process. AMELIE parses all 29 million PubMed abstracts and downloads and further parses hundreds of thousands of full-text articles in search of information supporting the causality and associated phenotypes of most published genetic variants. AMELIE then prioritizes patient candidate variants for their likelihood of explaining any patient’s given set of phenotypes. Diagnosis of singleton patients (without relatives’ exomes) is the most time-consuming scenario, and AMELIE ranked the causative gene at the very top for 66% of 215 diagnosed singleton Mendelian patients from the Deciphering Developmental Disorders project. Evaluating only the top 11 AMELIE-scored genes of 127 (median) candidate genes per patient resulted in a rapid diagnosis in more than 90% of cases. AMELIE-based evaluation of all cases was 3 to 19 times more efficient than hand-curated database–based approaches. We replicated these results on a retrospective cohort of clinical cases from Stanford Children’s Health and the Manton Center for Orphan Disease Research. An analysis web portal with our most recent update, programmatic interface, and code is available at AMELIE.stanford.edu
    corecore