261 research outputs found

    GENCODE: reference annotation for the human and mouse genomes in 2023.

    Get PDF
    GENCODE produces high quality gene and transcript annotation for the human and mouse genomes. All GENCODE annotation is supported by experimental data and serves as a reference for genome biology and clinical genomics. The GENCODE consortium generates targeted experimental data, develops bioinformatic tools and carries out analyses that, along with externally produced data and methods, support the identification and annotation of transcript structures and the determination of their function. Here, we present an update on the annotation of human and mouse genes, including developments in the tools, data, analyses and major collaborations which underpin this progress. For example, we report the creation of a set of non-canonical ORFs identified in GENCODE transcripts, the LRGASP collaboration to assess the use of long transcriptomic data to build transcript models, the progress in collaborations with RefSeq and UniProt to increase convergence in the annotation of human and mouse protein-coding genes, the propagation of GENCODE across the human pan-genome and the development of new tools to support annotation of regulatory features by GENCODE. Our annotation is accessible via Ensembl, the UCSC Genome Browser and https://www.gencodegenes.org

    GENCODE reference annotation for the human and mouse genomes

    Get PDF
    The accurate identification and description of the genes in the human and mouse genomes is a fundamental requirement for high quality analysis of data informing both genome biology and clinical genomics. Over the last 15 years, the GENCODE consortium has been producing reference quality gene annotations to provide this foundational resource. The GENCODE consortium includes both experimental and computational biology groups who work together to improve and extend the GENCODE gene annotation. Specifically, we generate primary data, create bioinformatics tools and provide analysis to support the work of expert manual gene annotators and automated gene annotation pipelines. In addition, manual and computational annotation workflows use any and all publicly available data and analysis, along with the research literature to identify and characterise gene loci to the highest standard. GENCODE gene annotations are accessible via the Ensembl and UCSC Genome Browsers, the Ensembl FTP site, Ensembl Biomart, Ensembl Perl and REST APIs as well as https://www.gencodegenes.org.National Human Genome Research Institute of the National Institutes of Healt

    MapSDI: A Scaled-up Semantic Data Integration Framework for Knowledge Graph Creation

    Full text link
    Semantic web technologies have significantly contributed with effective solutions for the problems of data integration and knowledge graph creation. However, with the rapid growth of big data in diverse domains, different interoperability issues still demand to be addressed, being scalability one of the main challenges. In this paper, we address the problem of knowledge graph creation at scale and provide MapSDI, a mapping rule-based framework for optimizing semantic data integration into knowledge graphs. MapSDI allows for the semantic enrichment of large-sized, heterogeneous, and potentially low-quality data efficiently. The input of MapSDI is a set of data sources and mapping rules being generated by a mapping language such as RML. First, MapSDI pre-processes the sources based on semantic information extracted from mapping rules, by performing basic database operators; it projects out required attributes, eliminates duplicates, and selects relevant entries. All these operators are defined based on the knowledge encoded by the mapping rules which will be then used by the semantification engine (or RDFizer) to produce a knowledge graph. We have empirically studied the impact of MapSDI on existing RDFizers, and observed that knowledge graph creation time can be reduced on average in one order of magnitude. It is also shown, theoretically, that the sources and rules transformations provided by MapSDI are data-lossless

    ncRNA BC1 influences translation in the oocyte

    Get PDF
    Regulation of translation is essential for the diverse biological processes involved in development. Particularly, mammalian oocyte development requires the precisely controlled translation of maternal transcripts to coordinate meiotic and early embryo progression while transcription is silent. It has been recently reported that key components of mRNA translation control are short and long noncoding RNAs (ncRNAs). We found that the ncRNABrain cytoplasmic 1 (BC1) has a role in the fully grown germinal vesicle (GV) mouse oocyte, where is highly expressed in the cytoplasm associated with polysomes. Overexpression of BC1 in GV oocyte leads to a minute decrease in global translation with a significant reduction of specific mRNA translation via interaction with the Fragile X Mental Retardation Protein (FMRP). BC1 performs a repressive role in translation only in the GV stage oocyte without forming FMRP or Poly(A) granules. In conclusion, BC1 acts as the translational repressor of specific mRNAs in the GV stage via its binding to a subset of mRNAs and physical interaction with FMRP. The results reported herein contribute to the understanding of the molecular mechanisms of developmental events connected with maternal mRNA translation

    Genome-wide variant calling in reanalysis of exome sequencing data uncovered a pathogenic TUBB3 variant.

    Get PDF
    Almost half of all individuals affected by intellectual disability (ID) remain undiagnosed. In the Solve-RD project, exome sequencing (ES) datasets from unresolved individuals with (syndromic) ID (n = 1,472 probands) are systematically reanalyzed, starting from raw sequencing files, followed by genome-wide variant calling and new data interpretation. This strategy led to the identification of a disease-causing de novo missense variant in TUBB3 in a girl with severe developmental delay, secondary microcephaly, brain imaging abnormalities, high hypermetropia, strabismus and short stature. Interestingly, the TUBB3 variant could only be identified through reanalysis of ES data using a genome-wide variant calling approach, despite being located in protein coding sequence. More detailed analysis revealed that the position of the variant within exon 5 of TUBB3 was not targeted by the enrichment kit, although consistent high-quality coverage was obtained at this position, resulting from nearby targets that provide off-target coverage. In the initial analysis, variant calling was restricted to the exon targets ± 200 bases, allowing the variant to escape detection by the variant calling algorithm. This phenomenon may potentially occur more often, as we determined that 36 established ID genes have robust off-target coverage in coding sequence. Moreover, within these regions, for 17 genes (likely) pathogenic variants have been identified before. Therefore, this clinical report highlights that, although compute-intensive, performing genome-wide variant calling instead of target-based calling may lead to the detection of diagnostically relevant variants that would otherwise remain unnoticed

    A Compendium of AR Splice Variants in Metastatic Castration-Resistant Prostate Cancer

    Get PDF
    Treatment-induced AR alterations, including AR alternative splice variants (AR-Vs), have been extensively linked to harboring roles in primary and acquired resistance to conventional and next-generation hormonal therapies in prostate cancer and therefore have gained momentum. Our aim was to uniformly determine recurrent AR-Vs in metastatic castration-resistant prostate cancer (mCRPC) using whole transcriptome sequencing in order to assess which AR-Vs might hold potential diagnostic or prognostic relevance in future research. This study reports that in addition to the promising AR-V7 as a biomarker, AR45 and AR-V3 were also seen as recurrent AR-Vs and that the presence of any AR-V could be associated with higher AR expression. With future research, these AR-Vs may therefore harbor similar or complementary roles to AR-V7 as predictive and prognostic biomarkers in mCRPC or as proxies for abundant AR expression.</p

    Diagnostic accuracy of liquid biopsy in endometrial cancer

    Get PDF
    Background: Liquid biopsy is a minimally invasive collection of a patient body fluid sample. In oncology, they offer several advantages compared to traditional tissue biopsies. However, the potential of this method in endometrial cancer (EC) remains poorly explored. We studied the utility of tumor educated platelets (TEPs) and circulating tumor DNA (ctDNA) for preoperative EC diagnosis, including histology determination. Methods: TEPs from 295 subjects (53 EC patients, 38 patients with benign gynecologic conditions, and 204 healthy women) were RNA-sequenced. DNA sequencing data were obtained for 519 primary tumor tissues and 16 plasma samples. Artificial intelligence was applied to sample classification. Results: Platelet-dedicated classifier yielded AUC of 97.5% in the test set when discriminating between healthy subjects and cancer patients. However, the discrimination between endometrial cancer and benign gynecologic conditions was more challenging, with AUC of 84.1%. ctDNA-dedicated classifier discriminated primary tumor tissue samples with AUC of 96% and ctDNA blood samples with AUC of 69.8%. Conclusions: Liquid biopsies show potential in EC diagnosis. Both TEPs and ctDNA profiles coupled with artificial intelligence constitute a source of useful information. Further work involving more cases is warranted.publishedVersio

    Ranked Choice Voting for Representative Transcripts with TRaCE

    Get PDF
    Genome sequencing projects annotate protein-coding gene models with multiple transcripts, aiming to represent all of the available transcript evidence. However, downstream analyses often operate on only one representative transcript per gene locus, sometimes known as the canonical transcript. To choose canonical transcripts, TRaCE (Transcript Ranking and Canonical Election) holds an ‘election’ in which a set of RNA-seq samples rank transcripts by annotation edit distance. These sample-specific votes are tallied along with other criteria such as protein length and InterPro domain coverage. The winner is selected as the canonical transcript, but the election proceeds through multiple rounds of voting to order all the transcripts by relevance. Based on the set of expression data provided, TRaCE can identify the most common isoforms from a broad expression atlas or prioritize alternative transcripts expressed in specific contexts
    corecore