241 research outputs found

    Spatial Distribution of Disease-associated Variants in Three-dimensional Structures of Protein Complexes

    No full text

    Integration of Random Forest Classifiers and Deep Convolutional Neural Networks for Classification and Biomolecular Modeling of Cancer Driver Mutations

    Get PDF
    Development of machine learning solutions for prediction of functional and clinical significance of cancer driver genes and mutations are paramount in modern biomedical research and have gained a significant momentum in a recent decade. In this work, we integrate different machine learning approaches, including tree based methods, random forest and gradient boosted tree (GBT) classifiers along with deep convolutional neural networks (CNN) for prediction of cancer driver mutations in the genomic datasets. The feasibility of CNN in using raw nucleotide sequences for classification of cancer driver mutations was initially explored by employing label encoding, one hot encoding, and embedding to preprocess the DNA information. These classifiers were benchmarked against their tree-based alternatives in order to evaluate the performance on a relative scale. We then integrated DNA-based scores generated by CNN with various categories of conservational, evolutionary and functional features into a generalized random forest classifier. The results of this study have demonstrated that CNN can learn high level features from genomic information that are complementary to the ensemble-based predictors often employed for classification of cancer mutations. By combining deep learning-generated score with only two main ensemble-based functional features, we can achieve a superior performance of various machine learning classifiers. Our findings have also suggested that synergy of nucleotide-based deep learning scores and integrated metrics derived from protein sequence conservation scores can allow for robust classification of cancer driver mutations with a limited number of highly informative features. Machine learning predictions are leveraged in molecular simulations, protein stability, and network-based analysis of cancer mutations in the protein kinase genes to obtain insights about molecular signatures of driver mutations and enhance the interpretability of cancer-specific classification models

    Understanding oncogenicity of cancer driver genes and mutations in the cancer genomics era

    Get PDF
    One of the key challenges of cancer biology is to catalogue and understand the somatic genomic alterations leading to cancer. Although alternative definitions and search methods have been developed to identify cancer driver genes and mutations, analyses of thousands of cancer genomes return a remarkably similar catalogue of around 300 genes that are mutated in at least one cancer type. Yet, many features of these genes and their role in cancer remain unclear, first and foremost when a somatic mutation is truly oncogenic. In this review, we first summarize some of the recent efforts in completing the catalogue of cancer driver genes. Then, we give an overview of different aspects that influence the oncogenicity of somatic mutations in the core cancer driver genes, including their interactions with the germline genome, other cancer driver mutations, the immune system, or their potential role in healthy tissues. In the coming years, this research holds promise to illuminate how, when, and why cancer driver genes and mutations are really drivers, and thereby move personalized cancer medicine and targeted therapies forward

    COMPUTATIONAL METHODS IN MISSENSE MUTATION ANALYSIS: PHENOTYPES, PATHOGENICITY, AND PROTEIN ENGINEERING

    Get PDF
    Understanding the molecular, phenotypic, and pathogenic effects of mutations is of enormous importance in human disease research and protein engineering. Both create a demand for computational methods to leverage the explosion of new sequence data and to explore the vast space of possible protein modifications and designs. My study in this dissertation demonstrates the value of computational methods in these areas. First, I developed a new ensemble method to predict continuous phenotype values as well as binary pathogenicity and objectively tested it in CAGI (Critical Assessment of Genome Interpretation). In two recent CAGI challenges, the method was ranked third in predicting the enzyme activity of missense mutations in NAGLU (N-Acetyl-Alpha-Glucosaminidase) and second in predicting the relative growth rate of mutated human SUMO-ligase in a yeast complementation assay. I also demonstrated the effectiveness of the new ensemble method for addressing a key problem limiting the use of current mutation interpretation methods in the clinic – identifying which mutations can be assigned a pathogenic or benign status with high confidence. Next, I characterized and compared missense variants in monogenic disease and in cancer. The study revealed a number of properties of mutations in these two types of diseases, including: (a) methods based on sequence conservation properties are as effective for identifying cancer driver mutations as they are for monogenic disease mutations; (b) mutations in disordered regions of protein structure play a relatively small role in both classes of disease; (c) oncogenic mutations tend to be on the protein surface while tumor suppressors are concentrated in the core; (d) a large fraction of tumor suppressors act by destabilizing protein structure and (e) mutations in passenger genes display a surprisingly high level of deleteriousness. Finally, I applied computational methods to screen for appropriate mutations to enhance the thermostability of a catalytic domain of PlyC. This bacteriophage-derived endolysin has been demonstrated to have antimicrobial potential but its potential use is limited by its inherent thermosuseptibility. To prioritize stabilizing mutations, I conducted a rapid exhaustive survey of point mutations followed by validation using protein modeling and expert knowledge. The approach yielded three stabilizing mutants experimentally verified by our collaborators, with one particularly successful in terms of both thermal denaturation temperature and kinetic stability

    Mechanism of activation and the rewired network: New drug design concepts

    Get PDF
    Precision oncology benefits from effective early phase drug discovery decisions. Recently, drugging inactive protein conformations has shown impressive successes, raising the cardinal questions of which targets can profit and what are the principles of the active/inactive protein pharmacology. Cancer driver mutations have been established to mimic the protein activation mechanism. We suggest that the decision whether to target an inactive (or active) conformation should largely rest on the protein mechanism of activation. We next discuss the recent identification of double (multiple) same-allele driver mutations and their impact on cell proliferation and suggest that like single driver mutations, double drivers also mimic the mechanism of activation. We further suggest that the structural perturbations of double (multiple) in cis mutations may reveal new surfaces/pockets for drug design. Finally, we underscore the preeminent role of the cellular network which is deregulated in cancer. Our structure-based review and outlook updates the traditional Mechanism of Action, informs decisions, and calls attention to the intrinsic activation mechanism of the target protein and the rewired tumor-specific network, ushering innovative considerations in precision medicine

    Protein Structure-Guided Approaches to Identify Functional Mutations in Cancer

    Get PDF
    Distinguishing driver mutations from passenger mutations within tumor cells continues to be a major challenge in cancer genomics. Many computational tools have been developed to address this challenge; however, they rely heavily on primary protein sequence context and frequency/mutation rate. Rare driver mutations not found in many cancer patients may be missed with these traditional approaches. Additionally, the structural context of mutations on tertiary/quaternary protein structures is not taken into account and may play a more prominent role in determining phenotype and function. This dissertation first presents a novel computational tool called HotSpot3D, which identifies regions of protein structures that are enriched in proximal mutations from cancer patients and identifies clusters of mutations within a single protein as well as along the interface of protein-protein complexes. This tool gives insight to potential rare driver mutations that may cluster closely to known hotspot driver mutations as well as critical regions of proteins specific to certain cancer types. A small subset of predictions from this tool are validated using high throughput phosphorylation data and in vitro cell-based assay to support its biological utility. We then shift to studying the druggability of mutations and apply HotSpot3D to identify potential druggable mutations that cluster with known sensitive actionable mutations. We also demonstrate how utilizing integrative omics approaches better enables precision oncology; Combining multiple data types such as genomic mutations or mRNA/protein expression outliers as biomarkers of druggability can expand the druggable cohort, better inform treatment response, and nominate novel combinatorial therapies for clinical trials. Lastly, we improve driver predictions of HotSpot3D by creating a supervised learning approach that integrates additional biological features related to structural context beyond just positional clustering. Overall, this dissertation provides a suite of computational methods to explore mutations in the context of protein structure and their potential implications in oncogenesis

    Analysis of whole-genome sequencing data from ICGC-PanCancer project

    Get PDF
    Cancer is one of the greatest health challenges of the 21st century and one of the deadliest diseases in the world. It is a group of different diseases which are caused by abnormal cell growth. In the human body, cell division and apoptosis are well regulated under normal circumstances so that the number of cells is in a dynamic balance. However, normal cells could transform into tumor cells because of genetic mutations. The tumorigenesis can happen in almost any cell of the human body. One of the central tools to address cancer is the profiling of cancer cell genomes and transcriptomes by next generation sequencing (NGS) and subsequent analysis by computational methods. The Pan-Cancer Analysis of Whole Genomes (PCAWG) project is the core project of the International Cancer Genome Consortium. This project provides massive amounts of cancer biological data for analysis. Include more than 2900 patients and 48 types of cancer samples. As part of this intensive effort, I have conducted a very detailed analysis on the molecular mechanisms of cancers. In particular, I conducted a comprehensive study of the relationship between genomic mutations and cancer development. These series of studies include the exploration of cancer driver genes, analysis of telomere maintenance mechanisms and data visualization at the cohort level. First, I explored potential cancer genes by performing statistical analysis of genomic point mutations, insertions and deletions, copy number variations and structural variations. Further, I analyzed the distribution of point mutations and structure variations in cancer genomes. Based on Knudson's two-hit hypothesis, I integrated point mutation and copy number variation information to construct a biallelic inactivation map of the cancer genome. With the biallelic inactivation information, I analyzed potential cancer drivers and applied this finding to synthetic lethality assays associated with cancer driver genes to uncover novel genetic targets that could be used to treat cancer patients with certain driver gene defects. In addition, I designed and improved the CaSINo model to score the relative mutation frequency of chromosomal sequences to screen for potential cancer driver mutations, which can be used not only in coding genes but also in non-coding regions. Moreover, I analyzed point mutations on promoters, trying to find those mutation sites that play a key role in the up-regulation of gene expression. Finally, I designed and improved a scoring method for copy number variation focality to explore the association of focal copy number variation with cancer driver genes at the cohort level. Second, as part of the PCAWG research projects, I analyzed the mechanisms of telomere maintenance in cancer cells. After analyzing the differences between alternative telomere lengthening and telomerase-positive samples, I designed a machine learning model based on repeat sequences, content, and mutation rate to determine whether an unknown cancer sample is an alternative lengthening of telomere (ALT) or telomerase-positive. Finally, for the massive data of the PCAWG project, I designed and implemented two bioinformatics visualization tools. TumorPrint is software in R and shell, which can be used to visualize genomic mutations and RNA-seq expression levels of a single gene or gene pairs, allowing users to quickly search for genes or gene pairs of interest. GenomeTornadoPlot is a software written in the R language for visualizing focal copy number variants of a single gene or adjacent paired genes, and can automatically calculate its copy number variation aggregation score

    Structure-based predictions broadly link transcription factor mutations to gene expression changes in cancers

    Full text link
    © 2014 The Author(s). Thousands of unique mutations in transcription factors (TFs) arise in cancers, and the functional and biological roles of relatively few of these have been characterized. Here, we used structure-based methods developed specifically for DNA-binding proteins to systematically predict the consequences of mutations in several TFs that are frequently mutated in cancers. The explicit consideration of protein-DNA interactions was crucial to explain the roles and prevalence of mutations in TP53 and RUNX1 in cancers, and resulted in a higher specificity of detection for known p53-regulated genes among genetic associations between TP53 genotypes and genome-wide expression in The Cancer Genome Atlas, compared to existing methods of mutation assessment. Biophysical predictions also indicated that the relative prevalence of TP53 missense mutations in cancer is proportional to their thermodynamic impacts on protein stability and DNA binding, which is consistent with the selection for the loss of p53 transcriptional function in cancers. Structure and thermodynamics-based predictions of the impacts of missense mutations that focus on specific molecular functions may be increasingly useful for the precise and large-scale inference of aberrant molecular phenotypes in cancer and other complex diseases
    • …
    corecore