31 research outputs found

    A community-powered search of machine learning strategy space to find NMR property prediction models

    Get PDF
    The rise of machine learning (ML) has created an explosion in the potential strategies for using data to make scientific predictions. For physical scientists wishing to apply ML strategies to a particular domain, it can be difficult to assess in advance what strategy to adopt within a vast space of possibilities. Here we outline the results of an online community-powered effort to swarm search the space of ML strategies and develop algorithms for predicting atomic-pairwise nuclear magnetic resonance (NMR) properties in molecules. Using an open-source dataset, we worked with Kaggle to design and host a 3-month competition which received 47,800 ML model predictions from 2,700 teams in 84 countries. Within 3 weeks, the Kaggle community produced models with comparable accuracy to our best previously published "in-house" efforts. A meta-ensemble model constructed as a linear combination of the top predictions has a prediction accuracy which exceeds that of any individual model, 7-19x better than our previous state-of-the-art. The results highlight the potential of transformer architectures for predicting quantum mechanical (QM) molecular properties

    Retrospective evaluation of whole exome and genome mutation calls in 746 cancer samples

    Full text link
    Funder: NCI U24CA211006Abstract: The Cancer Genome Atlas (TCGA) and International Cancer Genome Consortium (ICGC) curated consensus somatic mutation calls using whole exome sequencing (WES) and whole genome sequencing (WGS), respectively. Here, as part of the ICGC/TCGA Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium, which aggregated whole genome sequencing data from 2,658 cancers across 38 tumour types, we compare WES and WGS side-by-side from 746 TCGA samples, finding that ~80% of mutations overlap in covered exonic regions. We estimate that low variant allele fraction (VAF < 15%) and clonal heterogeneity contribute up to 68% of private WGS mutations and 71% of private WES mutations. We observe that ~30% of private WGS mutations trace to mutations identified by a single variant caller in WES consensus efforts. WGS captures both ~50% more variation in exonic regions and un-observed mutations in loci with variable GC-content. Together, our analysis highlights technological divergences between two reproducible somatic variant detection efforts

    Improved Scaffold Hopping in Ligand-based Virtual Screening Using Neural Representation Learning

    Full text link
    Deep learning has demonstrated significant potential in advancing state of the art in many problem domains, especially those benefiting from automated feature extraction. Yet the methodology has seen limited adoption in the field of ligand-based virtual screening (LBVS), as traditional approaches typically require large, target-specific training sets, which limits their value in most prospective applications. Here, we report the development of a neural network architecture, and a learning framework designed to yield a generally applicable tool for LBVS. Our approach uses the molecular graph as input, and involves learning a representation that places compounds of similar biological profiles in close proximity within a hyperdimensional feature space; this is achieved by simultaneously leveraging historical screening data against a multitude of targets during training. Cosine distance between molecules in this space becomes a general similarity metric, and can readily be used to rank order database compounds in LBVS workflows. We demonstrate the resulting model generalizes exceptionally well to compounds and targets not used in its training. In three commonly employed LBVS benchmarks, our method outperforms popular fingerprinting algorithms without the need for any target-specific training. Moreover, we show the learned representation yields superior performance in scaffold hopping tasks, and is largely orthogonal to existing fingerprints. Summarily, we have developed and validated a framework for learning a molecular representation that is applicable to LBVS in a target-agnostic fashion, with as few as one query compound. Our approach can also enable organizations to generate additional value from large screening data repositories, and to this end we are making its implementation freely available at https://github.com/totient-bio/gatnn-vs</pre

    ga4gh/task-execution-schemas: v0.3

    Full text link
    TES Issue backlog Minor (renames and cleanup) "contents" name ( https://github.com/ga4gh/task-execution-schemas/pull/71 ) "image_name", "cmd", "environ" renames ( https://github.com/ga4gh/task-execution-schemas/pull/83 ) Resources.size_gb" rename ( https://github.com/ga4gh/task-execution-schemas/pull/85 ) "TaskParameter" rename ( https://github.com/ga4gh/task-execution-schemas/pull/86 ) "ERROR" state rename ( https://github.com/ga4gh/task-execution-schemas/pull/88 ) Clarify stdout/err docs ( https://github.com/ga4gh/task-execution-schemas/pull/94 ) Clarify volumes ( https://github.com/ga4gh/task-execution-schemas/pull/95 ) Deletions: Remove "project" ( https://github.com/ga4gh/task-execution-schemas/pull/91 ) Remove "PAUSED" state ( https://github.com/ga4gh/task-execution-schemas/pull/89 ) Remove ports ( https://github.com/ga4gh/task-execution-schemas/pull/96 ) Additions: System logs field ( https://github.com/ga4gh/task-execution-schemas/pull/80 ) Field for task creation time (creation_time) ( https://github.com/ga4gh/task-execution-schemas/pull/90

    Sex differences in oncogenic mutational processes

    Get PDF
    Sex differences have been observed in multiple facets of cancer epidemiology, treatment and biology, and in most cancers outside the sex organs. Efforts to link these clinical differences to specific molecular features have focused on somatic mutations within the coding regions of the genome. Here we report a pan-cancer analysis of sex differences in whole genomes of 1983 tumours of 28 subtypes as part of the ICGC/TCGA Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium. We both confirm the results of exome studies, and also uncover previously undescribed sex differences. These include sex-biases in coding and non-coding cancer drivers, mutation prevalence and strikingly, in mutational signatures related to underlying mutational processes. These results underline the pervasiveness of molecular sex differences and strengthen the call for increased consideration of sex in molecular cancer research.Sex differences have been observed in multiple facets of cancer epidemiology, treatment and biology, and in most cancers outside the sex organs. Efforts to link these clinical differences to specific molecular features have focused on somatic mutations within the coding regions of the genome. Here we report a pan-cancer analysis of sex differences in whole genomes of 1983 tumours of 28 subtypes as part of the ICGC/TCGA Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium. We both confirm the results of exome studies, and also uncover previously undescribed sex differences. These include sex-biases in coding and non-coding cancer drivers, mutation prevalence and strikingly, in mutational signatures related to underlying mutational processes. These results underline the pervasiveness of molecular sex differences and strengthen the call for increased consideration of sex in molecular cancer research.Peer reviewe
    corecore