8,622 research outputs found

    Integrated Machine Learning and Bioinformatics Approaches for Prediction of Cancer-Driving Gene Mutations

    Get PDF
    Cancer arises from the accumulation of somatic mutations and genetic alterations in cell division checkpoints and apoptosis, this often leads to abnormal tumor proliferation. Proper classification of cancer-linked driver mutations will considerably help our understanding of the molecular dynamics of cancer. In this study, we compared several cancer-specific predictive models for prediction of driver mutations in cancer-linked genes that were validated on canonical data sets of functionally validated mutations and applied to a raw cancer genomics data. By analyzing pathogenicity prediction and conservation scores, we have shown that evolutionary conservation scores play a pivotal role in the classification of cancer drivers and were the most informative features in the driver mutation classification. Through extensive comparative analysis with structure-functional experiments and multicenter mutational calling data from PanCancer Atlas studies, we have demonstrated the robustness of our models and addressed the validity of computational predictions. We evaluated the performance of our models using the standard diagnostic metrics such as sensitivity, specificity, area under the curve and F-measure. To address the interpretability of cancer-specific classification models and obtain novel insights about molecular signatures of driver mutations, we have complemented machine learning predictions with structure-functional analysis of cancer driver mutations in several key tumor suppressor genes and oncogenes. Through the experiments carried out in this study, we found that evolutionary-based features have the strongest signal in the machine learning classification VII of driver mutations and provide orthogonal information to the ensembled-based scores that are prominent in the ranking of feature importance

    Cancer risk prediction with whole exome sequencing and machine learning

    Get PDF
    Accurate cancer risk and survival time prediction are important problems in personalized medicine, where disease diagnosis and prognosis are tuned to individuals based on their genetic material. Cancer risk prediction provides an informed decision about making regular screening that helps to detect disease at the early stage and therefore increases the probability of successful treatments. Cancer risk prediction is a challenging problem. Lifestyle, environment, family history, and genetic predisposition are some factors that influence the disease onset. Cancer risk prediction based on predisposing genetic variants has been studied extensively. Most studies have examined the predictive ability of variants in known mutated genes for specific cancers. However, previous studies have not explored the predictive ability of collective genomic variants from whole-exome sequencing data. It is crucial to train a model in one study and predict another related independent study to ensure that the predictive model generalizes to other datasets. Survival time prediction allows patients and physicians to evaluate the treatment feasibility and helps chart health treatment plans. Many studies have concluded that clinicians are inaccurate and often optimistic in predicting patients’ survival time; therefore, the need increases for automated survival time prediction from genomic and medical imaging data. For cancer risk prediction, this dissertation explores the effectiveness of ranking genomic variants in whole-exome sequencing data with univariate features selection methods on the predictive capability of machine learning classifiers. The dissertation performs cross-study in chronic lymphocytic leukemia, glioma, and kidney cancers that show that the top-ranked variants achieve better accuracy than the whole genomic variants. For survival time prediction, many studies have devised 3D convolutional neural networks (CNNs) to improve the accuracy of structural magnetic resonance imaging (MRI) volumes to classify glioma patients into survival categories. This dissertation proposes a new multi-path convolutional neural network with SNP and demographic features to predict glioblastoma survival groups with a one-year threshold that improves upon existing machine learning methods. The dissertation also proposes a multi-path neural network system to predict glioblastoma survival categories with a 14-year threshold from a heterogeneous combination of genomic variations, messenger ribonucleic acid (RNA) expressions, 3D post-contrast T1 MRI volumes, and 2D post-contrast T1 MRI modality scans that show the malignancy. In 10-fold cross-validation, the mean 10-fold accuracy of the proposed network with handpicked 2D MRI slices (that manifest the tumor), mRNA expressions, and SNPs slightly improves upon each data source individually

    Revolutionizing Genomic Instrumentation: Accelerated Base Calling With Deep Learning For Real-Time Precision

    Get PDF
    As deep learning methods are increasingly used in genomic instruments' basic base calling procedure, their significance in the field of genomics has increased. Recurrent neural networks (RNNs) and convolutional neural networks (CNNs) are used in this paradigm shift to decode complex genetic data. The ability of these neural networks to decipher picture and signal data produced by high-tech tools allows for the inference of the complex organization of the 3 billion nucleotide pairs that make up the human genome. The accuracy of sequencing reads is improved, and base naming is made possible more quickly after real-time data production, which has significant implications for genomics. This leads to a dramatic acceleration of the whole genomics workflow, from sample collection to the creation of Variant Call Format (VCF) files and final reports, ushering in a new age of speed and precision in genetic research

    Knowledge Driven Approaches and Machine Learning Improve the Identification of Clinically Relevant Somatic Mutations in Cancer Genomics

    Get PDF
    For cancer genomics to fully expand its utility from research discovery to clinical adoption, somatic variant detection pipelines must be optimized and standardized to ensure identification of clinically relevant mutations and to reduce laborious and error-prone post-processing steps. To address the need for improved catalogues of clinically and biologically important somatic mutations, we developed DoCM, a Database of Curated Mutations in Cancer (http://docm.info), as described in Chapter 2. DoCM is an open source, openly licensed resource to enable the cancer research community to aggregate, store and track biologically and clinically important cancer variants. DoCM is currently comprised of 1,364 variants in 132 genes across 122 cancer subtypes, based on the curation of 876 publications. To demonstrate the utility of this resource, the mutations in DoCM were used to identify variants of established significance in cancer that were missed by standard variant discovery pipelines (Chapter 3). Sequencing data from 1,833 cases across four TCGA projects were reanalyzed and 1,228 putative variants that were missed in the original TCGA reports were identified. Validation sequencing data were produced from 93 of these cases to confirm the putative variant we detected with DoCM. Here, we demonstrated that at least one functionally important variant in DoCM was recovered in 41% of cases studied. A major bottleneck in the DoCM analysis in Chapter 3 was the filtering and manual review of somatic variants. Several steps in this post-processing phase of somatic variant calling have already been automated. However, false positive filtering and manual review of variant candidates remains as a major challenge, especially in high-throughput discovery projects or in clinical cancer diagnostics. In Chapter 4, an approach that systematized and standardized the post-processing of somatic variant calls using machine learning algorithms, trained on 41,000 manually reviewed variants from 20 cancer genome projects, is outlined. The approach accurately reproduced the manual review process on hold out test samples, and accurately predicted which variants would be confirmed by orthogonal validation sequencing data. When compared to traditional manual review, this approach increased identification of clinically actionable variants by 6.2%. These chapters outline studies that result in substantial improvements in the identification and interpretation of somatic variants, the use of which can standardize and streamline cancer genomics, enabling its use at high throughput as well as clinically
    • …
    corecore