241 research outputs found
Integration of Random Forest Classifiers and Deep Convolutional Neural Networks for Classification and Biomolecular Modeling of Cancer Driver Mutations
Development of machine learning solutions for prediction of functional and clinical significance of cancer driver genes and mutations are paramount in modern biomedical research and have gained a significant momentum in a recent decade. In this work, we integrate different machine learning approaches, including tree based methods, random forest and gradient boosted tree (GBT) classifiers along with deep convolutional neural networks (CNN) for prediction of cancer driver mutations in the genomic datasets. The feasibility of CNN in using raw nucleotide sequences for classification of cancer driver mutations was initially explored by employing label encoding, one hot encoding, and embedding to preprocess the DNA information. These classifiers were benchmarked against their tree-based alternatives in order to evaluate the performance on a relative scale. We then integrated DNA-based scores generated by CNN with various categories of conservational, evolutionary and functional features into a generalized random forest classifier. The results of this study have demonstrated that CNN can learn high level features from genomic information that are complementary to the ensemble-based predictors often employed for classification of cancer mutations. By combining deep learning-generated score with only two main ensemble-based functional features, we can achieve a superior performance of various machine learning classifiers. Our findings have also suggested that synergy of nucleotide-based deep learning scores and integrated metrics derived from protein sequence conservation scores can allow for robust classification of cancer driver mutations with a limited number of highly informative features. Machine learning predictions are leveraged in molecular simulations, protein stability, and network-based analysis of cancer mutations in the protein kinase genes to obtain insights about molecular signatures of driver mutations and enhance the interpretability of cancer-specific classification models
Understanding oncogenicity of cancer driver genes and mutations in the cancer genomics era
One of the key challenges of cancer biology is to catalogue and understand the somatic genomic alterations leading to cancer. Although alternative definitions and search methods have been developed to identify cancer driver genes and mutations, analyses of thousands of cancer genomes return a remarkably similar catalogue of around 300 genes that are mutated in at least one cancer type. Yet, many features of these genes and their role in cancer remain unclear, first and foremost when a somatic mutation is truly oncogenic. In this review, we first summarize some of the recent efforts in completing the catalogue of cancer driver genes. Then, we give an overview of different aspects that influence the oncogenicity of somatic mutations in the core cancer driver genes, including their interactions with the germline genome, other cancer driver mutations, the immune system, or their potential role in healthy tissues. In the coming years, this research holds promise to illuminate how, when, and why cancer driver genes and mutations are really drivers, and thereby move personalized cancer medicine and targeted therapies forward
COMPUTATIONAL METHODS IN MISSENSE MUTATION ANALYSIS: PHENOTYPES, PATHOGENICITY, AND PROTEIN ENGINEERING
Understanding the molecular, phenotypic, and pathogenic effects of mutations is of enormous importance in human disease research and protein engineering. Both create a demand for computational methods to leverage the explosion of new sequence data and to explore the vast space of possible protein modifications and designs. My study in this dissertation demonstrates the value of computational methods in these areas. First, I developed a new ensemble method to predict continuous phenotype values as well as binary pathogenicity and objectively tested it in CAGI (Critical Assessment of Genome Interpretation). In two recent CAGI challenges, the method was ranked third in predicting the enzyme activity of missense mutations in NAGLU (N-Acetyl-Alpha-Glucosaminidase) and second in predicting the relative growth rate of mutated human SUMO-ligase in a yeast complementation assay. I also demonstrated the effectiveness of the new ensemble method for addressing a key problem limiting the use of current mutation interpretation methods in the clinic – identifying which mutations can be assigned a pathogenic or benign status with high confidence. Next, I characterized and compared missense variants in monogenic disease and in cancer. The study revealed a number of properties of mutations in these two types of diseases, including: (a) methods based on sequence conservation properties are as effective for identifying cancer driver mutations as they are for monogenic disease mutations; (b) mutations in disordered regions of protein structure play a relatively small role in both classes of disease; (c) oncogenic mutations tend to be on the protein surface while tumor suppressors are concentrated in the core; (d) a large fraction of tumor suppressors act by destabilizing protein structure and (e) mutations in passenger genes display a surprisingly high level of deleteriousness. Finally, I applied computational methods to screen for appropriate mutations to enhance the thermostability of a catalytic domain of PlyC. This bacteriophage-derived endolysin has been demonstrated to have antimicrobial potential but its potential use is limited by its inherent thermosuseptibility. To prioritize stabilizing mutations, I conducted a rapid exhaustive survey of point mutations followed by validation using protein modeling and expert knowledge. The approach yielded three stabilizing mutants experimentally verified by our collaborators, with one particularly successful in terms of both thermal denaturation temperature and kinetic stability
Mechanism of activation and the rewired network: New drug design concepts
Precision oncology benefits from effective early phase drug discovery decisions. Recently, drugging inactive protein conformations has shown impressive successes, raising the cardinal questions of which targets can profit and what are the principles of the active/inactive protein pharmacology. Cancer driver mutations have been established to mimic the protein activation mechanism. We suggest that the decision whether to target an inactive (or active) conformation should largely rest on the protein mechanism of activation. We next discuss the recent identification of double (multiple) same-allele driver mutations and their impact on cell proliferation and suggest that like single driver mutations, double drivers also mimic the mechanism of activation. We further suggest that the structural perturbations of double (multiple) in cis mutations may reveal new surfaces/pockets for drug design. Finally, we underscore the preeminent role of the cellular network which is deregulated in cancer. Our structure-based review and outlook updates the traditional Mechanism of Action, informs decisions, and calls attention to the intrinsic activation mechanism of the target protein and the rewired tumor-specific network, ushering innovative considerations in precision medicine
Recommended from our members
Identifying driver mutations in cancers
All cancers depend upon mutations in critical genes, which confer a selective advantage to the tumour cell. The key to understanding the contribution of a disease-associated mutation to the development and progression of cancer comes from an understanding of the consequences of that mutation on the function of the affected protein, and the impact on the pathways in which that protein is involved.
Using data from over 30 different cancers from whole-exome sequencing cancer genomic projects, I analysed over one million somatic mutations. I identified mutational hotspots within domain families by mapping small mutations to equivalent positions in multiple sequence alignments of protein domains. I found that gain of function mutations from oncogenes and loss of function mutations from tumour suppressors are normally found in different domain families and when observed in the same domain families, hotspot mutations are located at different positions within the multiple sequence alignment of the domain.
Next, I investigated the ability of seven prediction algorithms to discriminate between driver missense mutations in oncogenes and tumour suppressors. Using 19 features to describe these mutations, I then developed a random forest classifier, MOKCaRF, to distinguish between gain of function and loss of function missense mutations in cancer. MOKCaRF performs significantly better than existing algorithms.
I then evaluated the ability of six existing prediction tools to distinguish between pathogenic and neutral mutations for both inframe insertion and inframe deletion mutations. I developed my own classifiers using 11 features that perform better than the current algorithms.
Finally, using the algorithms that I developed, as well as changes in copy number and expression data for each gene, I analysed samples from 50 lung cancer patients to identify the actionable targets and potential new drug targets for each tumour
Protein Structure-Guided Approaches to Identify Functional Mutations in Cancer
Distinguishing driver mutations from passenger mutations within tumor cells continues to be a major challenge in cancer genomics. Many computational tools have been developed to address this challenge; however, they rely heavily on primary protein sequence context and frequency/mutation rate. Rare driver mutations not found in many cancer patients may be missed with these traditional approaches. Additionally, the structural context of mutations on tertiary/quaternary protein structures is not taken into account and may play a more prominent role in determining phenotype and function. This dissertation first presents a novel computational tool called HotSpot3D, which identifies regions of protein structures that are enriched in proximal mutations from cancer patients and identifies clusters of mutations within a single protein as well as along the interface of protein-protein complexes. This tool gives insight to potential rare driver mutations that may cluster closely to known hotspot driver mutations as well as critical regions of proteins specific to certain cancer types. A small subset of predictions from this tool are validated using high throughput phosphorylation data and in vitro cell-based assay to support its biological utility. We then shift to studying the druggability of mutations and apply HotSpot3D to identify potential druggable mutations that cluster with known sensitive actionable mutations. We also demonstrate how utilizing integrative omics approaches better enables precision oncology; Combining multiple data types such as genomic mutations or mRNA/protein expression outliers as biomarkers of druggability can expand the druggable cohort, better inform treatment response, and nominate novel combinatorial therapies for clinical trials. Lastly, we improve driver predictions of HotSpot3D by creating a supervised learning approach that integrates additional biological features related to structural context beyond just positional clustering. Overall, this dissertation provides a suite of computational methods to explore mutations in the context of protein structure and their potential implications in oncogenesis
Analysis of whole-genome sequencing data from ICGC-PanCancer project
Cancer is one of the greatest health challenges of the 21st century and one of the deadliest diseases in the world. It is a group of different diseases which are caused by abnormal cell growth. In the human body, cell division and apoptosis are well regulated under normal circumstances so that the number of cells is in a dynamic balance. However, normal cells could transform into tumor cells because of genetic mutations. The tumorigenesis can happen in almost any cell of the human body. One of the central tools to address cancer is the profiling of cancer cell genomes and transcriptomes by next generation sequencing (NGS) and subsequent analysis by computational methods.
The Pan-Cancer Analysis of Whole Genomes (PCAWG) project is the core project of the International Cancer Genome Consortium. This project provides massive amounts of cancer biological data for analysis. Include more than 2900 patients and 48 types of cancer samples. As part of this intensive effort, I have conducted a very detailed analysis on the molecular mechanisms of cancers. In particular, I conducted a comprehensive study of the relationship between genomic mutations and cancer development. These series of studies include the exploration of cancer driver genes, analysis of telomere maintenance mechanisms and data visualization at the cohort level.
First, I explored potential cancer genes by performing statistical analysis of genomic point mutations, insertions and deletions, copy number variations and structural variations. Further, I analyzed the distribution of point mutations and structure variations in cancer genomes. Based on Knudson's two-hit hypothesis, I integrated point mutation and copy number variation information to construct a biallelic inactivation map of the cancer genome. With the biallelic inactivation information, I analyzed potential cancer drivers and applied this finding to synthetic lethality assays associated with cancer driver genes to uncover novel genetic targets that could be used to treat cancer patients with certain driver gene defects. In addition, I designed and improved the CaSINo model to score the relative mutation frequency of chromosomal sequences to screen for potential cancer driver mutations, which can be used not only in coding genes but also in non-coding regions. Moreover, I analyzed point mutations on promoters, trying to find those mutation sites that play a key role in the up-regulation of gene expression. Finally, I designed and improved a scoring method for copy number variation focality to explore the association of focal copy number variation with cancer driver genes at the cohort level.
Second, as part of the PCAWG research projects, I analyzed the mechanisms of telomere maintenance in cancer cells. After analyzing the differences between alternative telomere lengthening and telomerase-positive samples, I designed a machine learning model based on repeat sequences, content, and mutation rate to determine whether an unknown cancer sample is an alternative lengthening of telomere (ALT) or telomerase-positive.
Finally, for the massive data of the PCAWG project, I designed and implemented two bioinformatics visualization tools. TumorPrint is software in R and shell, which can be used to visualize genomic mutations and RNA-seq expression levels of a single gene or gene pairs, allowing users to quickly search for genes or gene pairs of interest. GenomeTornadoPlot is a software written in the R language for visualizing focal copy number variants of a single gene or adjacent paired genes, and can automatically calculate its copy number variation aggregation score
Structure-based predictions broadly link transcription factor mutations to gene expression changes in cancers
© 2014 The Author(s). Thousands of unique mutations in transcription factors (TFs) arise in cancers, and the functional and biological roles of relatively few of these have been characterized. Here, we used structure-based methods developed specifically for DNA-binding proteins to systematically predict the consequences of mutations in several TFs that are frequently mutated in cancers. The explicit consideration of protein-DNA interactions was crucial to explain the roles and prevalence of mutations in TP53 and RUNX1 in cancers, and resulted in a higher specificity of detection for known p53-regulated genes among genetic associations between TP53 genotypes and genome-wide expression in The Cancer Genome Atlas, compared to existing methods of mutation assessment. Biophysical predictions also indicated that the relative prevalence of TP53 missense mutations in cancer is proportional to their thermodynamic impacts on protein stability and DNA binding, which is consistent with the selection for the loss of p53 transcriptional function in cancers. Structure and thermodynamics-based predictions of the impacts of missense mutations that focus on specific molecular functions may be increasingly useful for the precise and large-scale inference of aberrant molecular phenotypes in cancer and other complex diseases
- …