12 research outputs found

    BioMuta and BioXpress: mutation and expression knowledgebases for cancer biomarker discovery

    Get PDF
    Single-nucleotide variation and gene expression of disease samples represent important resources for biomarker discovery. Many databases have been built to host and make available such data to the community, but these databases are frequently limited in scope and/or content. BioMuta, a database of cancer-associated single-nucleotide variations, and BioXpress, a database of cancer-associated differentially expressed genes and microRNAs, differ from other disease-associated variation and expression databases primarily through the aggregation of data across many studies into a single source with a unified representation and annotation of functional attributes. Early versions of these resources were initiated by pilot funding for specific research applications, but newly awarded funds have enabled hardening of these databases to production-level quality and will allow for sustained development of these resources for the next few years. Because both resources were developed using a similar methodology of integration, curation, unification, and annotation, we present BioMuta and BioXpress as allied databases that will facilitate a more comprehensive view of gene associations in cancer. BioMuta and BioXpress are hosted on the High-performance Integrated Virtual Environment (HIVE) server at the George Washington University at https://hive.biochemistry.gwu.edu/biomuta and https://hive.biochemistry.gwu.edu/bioxpress, respectively

    Whole Genome Variant Dataset for Enriching Studies across 18 Different Cancers

    No full text
    Whole genome sequencing (WGS) has helped to revolutionize biology, but the computational challenge remains for extracting valuable inferences from this information. Here, we present the cancer-associated variants from the Cancer Genome Atlas (TCGA) WGS dataset. This set of data will allow cancer researchers to further expand their analysis beyond the exomic regions of the genome to the entire genome. A total of 1342 WGS alignments available from the consortium were processed with VarScan2 and deposited to the NCI Cancer Cloud. The sample set covers 18 different cancers and reveals 157,313,519 pooled (non-unique) cancer-associated single-nucleotide variations (SNVs) across all samples. There was an average of 117,223 SNVs per sample, with a range from 1111 to 775,470 and a standard deviation of 163,273. The dataset was incorporated into BigQuery, which allows for fast access and cross-mapping, which will allow researchers to enrich their current studies with a plethora of newly available genomic data

    An Exploration of Cancer-Associated Non-Coding Variations in Whole Genome Sequencing Data

    No full text
    Genomics has benefited from an explosion in affordable high-throughput technology for whole genome sequencing. The regulatory and functional aspects in non-coding regions may be an important contributor to oncogenesis. The majority of cancer-associated mutations in 154 whole genome sequences covering lung adenocarcinoma (LUAD), breast invasive carcinoma (BRCA), kidney renal papillary cell carcinoma (KIRP), uterine corpus endometrial carcinoma (UCEC), and colon adenocarcinoma (COAD) and two races are found outside of the coding region (4,432,885 in non-coding versus 1,412,731 in coding regions). A pan-cancer analysis found significantly mutated windows (292 to 3,881 in count) demonstrating that there are significant numbers of large mutated regions in the non-coding genome. Fifty-nine significantly mutated windows were found in all studied races and cancers, including many found in centromeric locations. The X chromosome had the largest set of universal windows which cluster almost exclusively in Xq11– an area linked to chromosomal instability and oncogenesis. The presence of 19 to 114 large consecutive clusters (super windows) provide further evidence that large mutated regions in the genome are influencing cancer development. We investigated the frequency of these single-nucleotide variations in 12 different tissue-independent DNA functional elements. We demonstrated that the overlap of cancer-linked variations with these DNA functional elements is not likely the result of random selection, and most functional elements had significantly more single-nucleotide variations than expected. We identified highly variant functional elements in 5 cancer types, primarily in long non-coding RNAs and transcription factor binding sites, suggesting that some functional elements might have wide-ranging effects on oncogenesis. Finally, we demonstrated that the ratios of SNVs within DNA functional elements show a level of distinction, suggesting that different cancer types can be fingerprinted via these ratios. A multinomial logistic regression algorithm was combined with one-hot encoding, a cross-entropy distance function for loss calculation, and we created a stochastic gradient descent function to build several prediction models based on the data generated. Three models were generated and trained off of a binary representation of variation in the 59 universal windows, variation counts in the 59 universal windows, and ratios of variations found in DNA functional elements. These models performed at 53.3%, 33.3%, and 40.0% accuracy on the test set, respectively. Counterintuitively, the model with the lowest performance (variation counts in the universal windows) showed the most promise for improvement through increased data

    Investigation of somatic single nucleotide variations in human endogenous retrovirus elements and their potential association with cancer.

    No full text
    Human endogenous retroviruses (HERVs) have been investigated for potential links with human cancer. However, the distribution of somatic nucleotide variations in HERV elements has not been explored in detail. This study aims to identify HERV elements with an over-representation of somatic mutations (hot spots) in cancer patients. Four HERV elements with mutation hotspots were identified that overlap with exons of four human protein coding genes. These hotspots were identified based on the significant over-representation (p<8.62e-4) of non-synonymous single-nucleotide variations (nsSNVs). These genes are TNN (HERV-9/LTR12), OR4K15 (HERV-IP10F/LTR10F), ZNF99 (HERV-W/HERV17/LTR17), and KIR2DL1 (MST/MaLR). In an effort to identify mutations that effect survival, all nsSNVs were further evaluated and it was found that kidney cancer patients with mutation C2270G in ZNF99 have a significantly lower survival rate (hazard ratio = 2.6) compared to those without it. Among HERV elements in the human non-protein coding regions, we found 788 HERVs with significantly elevated numbers of somatic single-nucleotide variations (SNVs) (p<1.60e-5). From this category the top three HERV elements with significantly over-represented SNVs are HERV-H/LTR7, HERV-9/LTR12 and HERV-L/MLT2. Majority of the SNVs in these 788 HERV elements are located in three DNA functional groups: long non-coding RNAs (lncRNAs) (60%), introns (22.2%) and transcriptional factor binding sites (TFBS) (14.8%). This study provides a list of mutational hotspots in HERVs, which could potentially be used as biomarkers and therapeutic targets

    Exploring and retrieving sequence and metadata for species across the tree of life with NCBI Datasets

    No full text
    Abstract To explore complex biological questions, it is often necessary to access various data types from public data repositories. As the volume and complexity of biological sequence data grow, public repositories face significant challenges in ensuring that the data is easily discoverable and usable by the biological research community. To address these challenges, the National Center for Biotechnology Information (NCBI) has created NCBI Datasets. This resource provides straightforward, comprehensive, and scalable access to biological sequences, annotations, and metadata for a wide range of taxa. Following the FAIR (Findable, Accessible, Interoperable, and Reusable) data management principles, NCBI Datasets offers user-friendly web interfaces, command-line tools, and documented APIs, empowering researchers to access NCBI data seamlessly. The data is delivered as packages of sequences and metadata, thus facilitating improved data retrieval, sharing, and usability in research. Moreover, this data delivery method fosters effective data attribution and promotes its further reuse. This paper outlines the current scope of data accessible through NCBI Datasets and explains various options for exploring and downloading the data
    corecore