123 research outputs found

    Integrative Genomics Approaches to Identify Genetic Drivers of Human Breast Cancer

    Get PDF
    Breast cancer is a heterogeneous disease as revealed by next generation sequencing studies. Intrinsic molecular subtypes defined by gene expression profiles have been extensively validated in both research and clinic setting, with distinct genetic, epigenetic and transcriptomic characteristics within each subtype. Breast cancer has been shown to be driven by multiple types of alterations including somatic mutations and copy number changes. To identify genetic drivers of breast cancer from the background of massive passenger alterations is critical to understand tumor biology underlying heterogeneity and to facilitate personalized treatment decisions. Integrating multi-platform genomic data is likely to aid in the identification of genetic drivers of tumor through comprehensive molecular characterizations. In this thesis, we used multi-platform genomic data coming from publicly available datasets on cancer patients and cell lines as well as from a new cohort of breast cancer patients, to fully characterize many aspects of tumor biology and identify potential genetic drivers through integrative analyses. Using Elastic Net statistically modeling approach, we demonstrate that copy number alterations (CNA) drive key gene expression phenotypes, protein expressions and clinical features with these phenotypes being accurately predicted using DNA CNAs only. We performed DNA and RNA sequencing on a set of estrogen receptor positive breast cancer patients receiving endocrine therapy and identified genetic alterations and transcriptomic profiles specific to resistant tumors compared to sensitive tumors. Finally, we analyzed cancer dependency map revealed by large-scale loss-of-function screens in cancer cell lines and provide new insights on driver identification. In summary, this work sought to address the critical problem of cancer drivers from multiple perspectives. We show the power of integrative analysis to understand the genetic causes underlying tumor behavior where we identify known and novel alterations driving tumor progression and drug resistance. The findings presented here highlight the genetic diversity of breast cancer between and within intrinsic molecular subtypes and the ability to utilize multiple data types to elucidate this heterogeneity.Doctor of Philosoph

    Large-Scale and Pan-Cancer Multi-omic Analyses with Machine Learning

    Get PDF
    Multi-omic data analysis has been foundational in many fields of molecular biology, including cancer research. Investigation of the relationship between different omic data types reveals patterns that cannot otherwise be found in a single data type alone. With recent technological advancements in mass spectrometry (MS), MS-based proteomics has enabled the quantification of thousands of proteins in hundreds of cell lines and human tissue samples. This thesis presents several machine learning-based methods that facilitate the integrative analysis of multi-omic data. First, we reviewed five existing multi-omic data integration methods and performed a benchmarking analysis, using a large-scale multi-omic cancer cell line dataset. We evaluated the performance of these machine learning methods for drug response prediction and cancer type classification. Our result provides recommendations to researchers regarding optimal machine learning method selection for their applications. Second, we generated a pan-cancer proteomic map of 949 cancer cell lines across 40 cancer types and developed a machine learning method DeeProM to analyse the multi-omic information of these lines. This pan-cancer proteomic map (ProCan-DepMapSanger) is now publicly available and represents a major resource for the scientific community, for biomarker discovery and for the study of fundamental aspects of protein regulation. Third, we focused on publicly available multi-omic datasets of both cancer cell lines and human tissue samples and developed a Transformer-based deep learning method, DeePathNet, which integrates human knowledge with machine intelligence. We applied DeePathNet on three evaluation tasks, namely drug response prediction, cancer type classification and breast cancer subtype classification. Taken together, our analyses and methods allowed more accurate cancer diagnosis and prognosis

    Computational integration of genome-wide observational and functional data in cancer

    Get PDF
    The emergence of high throughput technologies is enabling the characterization of cancer genomes at unprecedented resolution and scale. However, such data suffer from the typical limitations of observational studies, which are frequently challenged by their inability to differentiate between causality and correlation. Recently, several datasets of genome-wide functional assays performed on tumor cell lines have become available. Given the ability of these assays to interrogate cancer genomes for the function of each individual gene, these data can provide vital cues to identify causal events and, with them, novel drug targets. Unfortunately, current analytical methods have been unable to overcome the challenges posed by these assays, which include poor signal to noise ratio and wide-spread off-target effects. Given the largely orthogonal strengths and weaknesses of descriptive analysis of genetic and genomic observational data from cancer genomes and genome-wide functional screening, I hypothesized that integrating the two data types into unified computational models would significantly increase the power of the biological analysis. In this dissertation I use integrative approaches to tackle two crucial problems in cancer research: the identification of driver genes and the discovery of tumor lethalities. I use the resulting methods to study breast cancer, the second most common form of this disease. The first part of the dissertation focuses on the analysis of regions of copy number alteration for the identification of driver genes. I first describe how a simple integrative method enabled the identification of BIN3, a novel driver of metastasis in breast cancer. I then describe Helios, an unsupervised method for the identification of driver genes in regions of SCNA that integrates different data sources into a single probabilistic score. Applying Helios to breast cancer data identified a set of candidate drivers highly enriched with known drivers (p-value < e-14). In vitro validation of 12 novel candidates predicted by Helios found 10 conferred enhanced anchorage independent growth, demonstrating Helios's exquisite sensitivity and specificity. I further provide an extensive characterization of RSF-1, a driver identified by Helios whose amplification correlates with poor prognosis, which displayed increased tumorigenesis and metastasis in mouse models. The second part of this dissertation addresses the problem of identifying tumor vulnerabilities using genome-wide shRNA screens across tumor cell lines. I approach this endeavor using a novel integrative method that employs different biomarkers of cellular state to facilitate the identification of clusters of hairpins with similar phenotype. When applied to breast cancer data, the method not only recapitulates the main subtypes and lethalities associated to this malignancy, but also identifies several novel putative lethalities. Taken together, this research demonstrates the importance of the computational integration of genome-wide functional and observational data in cancer research, providing novel approaches that yield important insights into the biology of the disease

    A Multi-omic Precision Oncology Pipeline to Elucidate Mechanistic Determinants of Cancer

    Get PDF
    Despite decades of effort, the mechanistic underpinnings of many cancers remain unsolved It has increasingly become appreciated that cancers can be more readily classified by their transcriptional identities rather than by genomics alone. A fuller understanding of the mechanistic connections between the aberrant genomics leading to the transcriptional dysregulation of tumors is key to both improving our knowledge of cancer biology as well as developing more precise and effective therapeutics. This thesis explores the development and application of a network based multi-omic master regulator framework designed to elucidate these pathways. In Chapter 2 we apply this analysis across 20 tumor types from the Cancer Genome Atlas and in doing so identify 407 key master regulators responsible for canalizing a high percentage of the driver genetics present across these samples. Further evaluation of these key regulators revealed a highly modular structure, indicating that the regulators work in coordinated groups to implement a variety of key cancer hallmarks. Genetic and pharmacological validation assays confirmed the predicted interactions and biological phenotypes. Chapter 3 focuses on the application of this analytical framework specifically on gastroesophageal tumors. Using a more fine-grained approach we find 15 distinct subtypes across a cohort of these heterogenous tumors. These subtypes align well with previously identified features of these cancers but also reveal novel genomic associations and key master regulators that can serve as potential avenues for therapeutic treatment

    Discovery of tissue specific network properties associated with cancer driver genes

    Get PDF
    Tese de Mestrado em Bioquímica, Faculdade de Ciências, Universidade de Lisboa, 2022Using the notion of disease modules, network medicine has effectively identified diseaseassociated genes in recent years. In biological networks, genes linked to a particular illness tend to interact closely [1]. These networks allow both physical and functional connections between biomolecules to be identified, resulting in a map of cell components and processes that constitute biological systems [2]. Not all disease-associated genes, however, have a major impact on disease phenotype. The discovery of important genes able to produce or change disease phenotype paves the path to new therapies and a personalized medicine strategy. Recent research has found that biological network topological features per se may accurately predict perturbation effects in a dynamical model of the system with a 65-80% accuracy [3, 4]. Biological networks differ depending on whatever tissue or cell type is being studied. As a result, each gene's topological features and ability to impact the system may alter [5]. The main goal of this thesis is to discover network topological parameters associated with influential cancer driver genes using context specific networks. In order to achieve this, we evaluated local network features around each driver gene across multiple tissue specific networks, including tissues that are affected in the disease and others where the gene perturbation has no significant effect. We aimed to identify topological parameters and its characteristics contributing to the cancer driver gene’s influential role. The results of this dissertation point out that several topological parameters can be used to determine cancer “driver” genes. We found that these genes have higher values of topological parameters, such as Degree or Closeness, in tissues where they tend to cause cancer. We also found that this difference is present in oncogenes and tumor suppressor genes. Another factor that we found to influence the value of topological parameters is the number of tissues in which these genes cause the disease. There is an increasing trend of topological parameter values with the increase of the number of tissues in which they cause cancer. Together, these results support the significant association of topological parameters like the Degree with the influential role of a driver gene in cancer.Usando a noção de módulos de doença, a medicina de redes identificou eficazmente nos últimos anos genes associados a doenças. Nas redes biológicas, os genes ligados a uma determinada doença tendem a interagir proximamente [1] . Essas redes permitem que conexões físicas e funcionais entre biomoléculas sejam identificadas, resultando num mapa de componentes celulares e processos que constituem sistemas biológicos [2]. Nem todos os genes associados à doença, no entanto, têm um grande impacto no fenótipo da doença. A descoberta de genes importantes capazes de produzir ou alterar o fenótipo da doença abre caminho para novas terapias e uma estratégia de medicina personalizada. Pesquisas recentes descobriram que as características topológicas da rede biológica podem prever com precisão os efeitos de perturbação num modelo dinâmico do sistema com uma precisão de 65-80% [3, 4]. As redes biológicas diferem dependendo do tipo de tecido ou célula estudado. Como resultado, as características topológicas de cada gene e a capacidade de impactar o sistema podem ser alteradas [5]. O principal objetivo desta dissertação é descobrir parâmetros topológicos de rede associados a genes promotores de cancro usando redes específicas de tecido. Para conseguir isso, avaliamos as características da rede local em torno de cada gene promotor em várias redes específicas de tecidos, incluindo tecidos afetados pela doença e outros onde a perturbação do gene não tem efeito significativo. Deste modo, podemos identificar parâmetros topológicos e as características que contribuem para o papel influente dos genes promotores do cancro. Para atingir os nossos objetivos, começámos por construir e otimizar as nossas redes específicas de tecidos. Cada rede específica de tecido foi construída usando quatro bases de dados diferentes de interações proteína-proteína, vias de sinalização e fatores de transcrição. Tentámos quatro métodos diferentes de construir as redes, incluindo o uso do filtro de níveis de expressão génica acima de 0,1 e 5 transcritos por milhão em cada tecido. Construímos também uma matriz associando os genes promotores de cancro (retirados de uma base de dados online de genes promotores de cancro) aos tecidos onde provocam a doença. Cada gene promotor foi inserido em seis categorias diferentes de acordo com o número de tecidos onde provocam cancro, sendo a categoria seis aquela que inclui os genes que provocam a doença em seis ou mais tecidos. Começámos por comparar os valores dos parâmetros topológicos dos genes em tecidos onde estes provocam a doença versus os seus valores em tecidos onde não a provocam. Esses valores também foram comparados com uma lista de genes associados ao cancro (retirados de uma base de dados online de genes associados a doenças), mas não promotores de cancro, e uma lista de genes não associados a nenhuma doença. Este estudo foi feito sobre os quatro diferentes métodos de construção de rede. Continuámos o estudo observando como os parâmetros topológicos mostraram diferenças ao nível do tecido. Analisámos em cada tecido os valores dos parâmetros topológicos dos genes promotores que causam a doença num determinado tecido versus os valores dos genes que não causam doença naquele tecido. Depois de comparar os valores dos parâmetros topológicos usando todos os genes promotores juntos num grupo global, queríamos verificar se a diferença entre os valores destes nos tecidos onde causam cancro versus os valores nos tecidos onde não provocam a doença, também estava presente dentro das categorias do número de tecidos onde os genes promotores causam cancro e como esses valores aumentam ou diminuem ao longo dessas categorias. Avaliamos em seguida o impacto combinado dos valores dos parâmetros topológicos (selecionando o parâmetro topológico “Degree”) de genes promotores de cancro em tecidos onde causam doença versus onde não causam e também a diferença entre estes ao longo das seis diferentes categorias de número de tecidos onde provocam cancro, usando um Modelo Linear Generalizado (GLM) para avaliar a interação desses fatores. Da base de dados de onde retiramos a lista de genes promotores de cancro, também retiramos uma lista de oncogenes e genes supressores de tumor que usámos para avaliar também as diferenças dos valores dos seus parâmetros topológicos nos tecidos onde causam cancro versus os tecidos onde não causam. A fim de avaliar outras variáveis que possam ter impacto para além dos parâmetros topológicos e que possam também diferir dependendo do número de tecidos onde os genes “drivers” causam a doença, usamos os dados da base de dados de onde retiramos os genes promotores que incluíam informações sobre o número de interações que cada gene promotor estabelece com diferentes miRNA e sobre o número de complexos proteicos que estes genes integram. Também avaliamos o impacto da expressão génica nas diferentes categorias de número de tecidos. Por fim, enriquecemos funcionalmente os genes promotores de cancro, usando dois métodos diferentes. No primeiro método usamos os genes que tinham uma diferença topológica maior (para este estudo usamos apenas o parâmetro topológico “Degree”) entre os tecidos onde causam ou não cancro. Classificamos cada gene como positivo, negativo e não significativo com base na diferença entre o valor médio do “Degree” nos tecidos onde causam cancro versus o valor nos tecidos onde não causam. O segundo método foi o enriquecimento dos diferentes genes promotores de cancro de acordo com o número de tecidos que causam cancro. Fizemos esse estudo usando as diferentes categorias de número de tecidos. Globalmente, os nossos resultados sugerem que os valores dos parâmetros topológicos (por exemplo, “Degree“ e “Closeness”) tendem a ser maiores nos tecidos em que os genes promoteres de cancro provocam a doença ( “Tissue Drivers”), seguidos pelos valores dos genes de cancro que são não promotores de cancro mas estão associados ao desenvolvimento da doença (“Disease Genes”), os valores dos genes promotores de cancro nos tecidos onde não causam cancro (“NonTissueDrivers”) e por último, com os menores valores de parâmetros topológicos, os genes que não estão associados a qualquer doença. A diferença entre os valores dos parâmetros topológicos nos “TissueDrivers” versus “NonTissueDrivers” é estatisticamente significativa na maioria dos parâmetros topológicos testados e nos diferentes métodos de rede utilizados, exceto no método “JustHuRiTPM5Zminmax” (usando apenas a base de dados Huri). Quando analisámos em cada tecido os valores dos parâmetros topológicos, pudemos ver que os valores de “Degree” tendem a ser maiores nos genes promotores de cancro que causam cancro naquele tecido em comparação com os genes promotores que não provocam cancro nesse tecido. Essa diferença é estatisticamente significativa em muitos dos tecidos analisados. Em relação a como os valores dos parâmetros topológicos se comportam ao longo das diferentes categorias associadas ao número de tecidos em que os genes promotores causam cancro, descobrimos que nos genes promotores de cancro que causam doença em apenas em um e dois tecidos, o valor do “Degree” nos tecidos onde causam cancro é menor que o valor apresentado nos tecidos onde não causam cancro. Observamos a tendência inversa nos genes promotores que causam cancro em seis ou mais tecidos (o valor do “Degree” é maior nos tecidos onde causam cancro). Observamos também que o valor do “Degree” aumenta gradativamente ao longo do número da categoria de tecidos, atingindo o valor mais alto na categoria seis (constituída por genes promotores que provocam cancro em seis ou mais tecidos). No modelo linear generalizado (GLM), pudemos ver o efeito combinado da variável tipo de tecido (onde o gene promotor provoca ou não cancro, mostrando uma diferença estatisticamente significativa entre estas duas situações) e da variável número de tecidos onde os genes promotores provocam cancro (mostrando também uma valor estatisticamente significativo entre as diferentes categorias). A interação entre esses dois fatores também foi estatisticamente significativa. Também pudemos observar valores de “Degree” estatisticamente diferentes entre os genes promotores supressores de tumor nos tecidos que causam cancro (com valores mais altos) e os valores nos tecidos onde não causam. Vimos também a mesma diferença nos Oncogenes, mas com menor significância. Os valores do “Degree” nos genes Supressores de Tumores foram inferiores aos valores do “Degree” apresentados pelos Oncogenes. Pudemos igualmente ver uma clara tendência de correlação entre o aumento do número de tecidos com o aumento do número de complexos que os genes promotores de cancro integram. O mesmo comportamento foi observado em relação ao número de miRNAs com os quais os genes promotores interagem. Em relação à expressão do mRNA ao longo das categorias de número de tecidos, pudemos ver uma diferença estatisticamente significativa nas categorias dois e três entre os valores dos genes promotores(em relação ao parâmetro topológico “Degree”) nos tecidos onde causam cancro versus onde não causam. Finalmente, no estudo de enriquecimento de funções pudemos ver que os processos biológicos, funções moleculares e componentes celulares que obtivemos enriquecidos usando o método das diferentes categorias de número de tecidos estão muito mais relacionados com os processos de cancro baseados na literatura (“hallmarks of cancer”). Não conseguimos encontrar uma divisão muito clara entre funções biológicas enriquecidas que tiveram uma diferença de z-score do “Degree” acima de 1 e aqueles com diferença abaixo de -1. Não encontramos nenhum processo de enriquecimento funcional relevante em nenhum desses dois grupos de genes e que de alguma forma os pudesse distinguir entre si. Os resultados desta dissertação apontam para que vários parâmetros topológicos possam estar associados a genes promotores de cancro. Verificámos que estes genes têm valores de parâmetros topológicos, como o Degree ou Closeness, mais elevados nos tecidos onde tendencionalmente provocam cancro. Verificámos também que esta diferença está presente nos oncogenes e nos genes supressores de tumor. Outro fator que verificamos influenciar o valor dos parâmetros topológicos, é o número de tecidos em que estes genes provocam a doença. Há uma tendência crescente do valor topológico com um número de tecidos em que provocam cancro

    Bioinformatics applied to human genomics and proteomics: development of algorithms and methods for the discovery of molecular signatures derived from omic data and for the construction of co-expression and interaction networks

    Get PDF
    [EN] The present PhD dissertation develops and applies Bioinformatic methods and tools to address key current problems in the analysis of human omic data. This PhD has been organised by main objectives into four different chapters focused on: (i) development of an algorithm for the analysis of changes and heterogeneity in large-scale omic data; (ii) development of a method for non-parametric feature selection; (iii) integration and analysis of human protein-protein interaction networks and (iv) integration and analysis of human co-expression networks derived from tissue expression data and evolutionary profiles of proteins. In the first chapter, we developed and tested a new robust algorithm in R, called DECO, for the discovery of subgroups of features and samples within large-scale omic datasets, exploring all feature differences possible heterogeneity, through the integration of both data dispersion and predictor-response information in a new statistic parameter called h (heterogeneity score). In the second chapter, we present a simple non-parametric statistic to measure the cohesiveness of categorical variables along any quantitative variable, applicable to feature selection in all types of big data sets. In the third chapter, we describe an analysis of the human interactome integrating two global datasets from high-quality proteomics technologies: HuRI (a human protein-protein interaction network generated by a systematic experimental screening based on Yeast-Two-Hybrid technology) and Cell-Atlas (a comprehensive map of subcellular localization of human proteins generated by antibody imaging). This analysis aims to create a framework for the subcellular localization characterization supported by the human protein-protein interactome. In the fourth chapter, we developed a full integration of three high-quality proteome-wide resources (Human Protein Atlas, OMA and TimeTree) to generate a robust human co-expression network across tissues assigning each human protein along the evolutionary timeline. In this way, we investigate how old in evolution and how correlated are the different human proteins, and we place all them in a common interaction network. As main general comment, all the work presented in this PhD uses and develops a wide variety of bioinformatic and statistical tools for the analysis, integration and enlighten of molecular signatures and biological networks using human omic data. Most of this data corresponds to sample cohorts generated in recent biomedical studies on specific human diseases

    Exploration of large molecular datasets using global gene networks : computational methods and tools

    Get PDF
    Defining gene expression profiles and mapping complex interactions between molecular regulators and proteins is a key for understanding biological processes and the functional properties of cells, which is therefore, the focus on numerous experimental studies. Small-scale biochemical analyses deliver high-quality data, but lack coverage, whereas high throughput sequencing reveals thousands of interactions which can be error-prone and require proper computational methods to discover true relations. Furthermore, all these approaches usually focus on one type of interaction at a time. This makes experimental mapping of the genome-wide network a cost and time-intensive procedure. In the first part of the thesis, I present the developed network analysis tools for exploring large- scale datasets in the context of a global network of functional coupling. Paper I introduces NEArender, a method for performing pathway analysis and determines the relations between gene sets using a global network. Traditionally, pathway analysis did not consider network relations, thereby covering a minor part of the whole picture. Placing the gene sets in the context of a network provides additional information for pathway analysis, which reveals a more comprehensive picture. Paper II presents EviNet, a user-friendly web interface for using NEArender algorithm. The user can either input gene lists or manage and integrate highly complex experimental designs via the interactive Venn diagram-based interface. The web resource provides access to biological networks and pathways from multiple public or users’ own resources. The analysis typically takes seconds or minutes, and the results are presented in a graphic and tabular format. Paper III describes NEAmarker, a method to predict anti-cancer drug targets from enrichment scores calculated by NEArender, thus presenting a practical usage of network enrichment tool. The method can integrate data from multiple omics platforms to model drug sensitivity with enrichment variables. In parallel, alternative methods for pathway enrichment analysis were benchmarked in the paper. The second part of the thesis is focused on identifying spatial and temporal mechanisms that govern the formation of neural cell diversity in the developing brain. High-throughput platforms for RNA- and ChIP-sequencing were applied to provide data for studying the underlying biological hypothesis at the genome-wide scale. In Paper IV, I defined the role of the transcription factor Foxa2 during the specification and differentiation of floor plate cells of the ventral neural tube. By RNA-seq analyses of Foxa2-/- cells, a large set of candidate genes involved in floor plate differentiation were identified. Analysis of Foxa2 ChIP-seq dataset suggested that Foxa2 directly regulated more than 250 genes expressed by the floor plate and identified Rfx4 and Ascl1 as co-regulators of many floor plate genes. Experimental studies suggested a cooperative activator function for Foxa2 and Rfx4 and a suppressive role for Ascl1 in spatially constraining floor plate induction. Paper V addresses how time is measured during sequential specification of neurons from multipotent progenitor cells during the development of ventral hindbrain. An underlying timer circuitry which leads to the sequential generation of motor neurons and serotonergic neurons has been identified by integrating experimental and computational data modeling
    corecore