423 research outputs found

    TLGP: a flexible transfer learning algorithm for gene prioritization based on heterogeneous source domain

    Get PDF
    BackgroundGene prioritization (gene ranking) aims to obtain the centrality of genes, which is critical for cancer diagnosis and therapy since keys genes correspond to the biomarkers or targets of drugs. Great efforts have been devoted to the gene ranking problem by exploring the similarity between candidate and known disease-causing genes. However, when the number of disease-causing genes is limited, they are not applicable largely due to the low accuracy. Actually, the number of disease-causing genes for cancers, particularly for these rare cancers, are really limited. Therefore, there is a critical needed to design effective and efficient algorithms for gene ranking with limited prior disease-causing genes.ResultsIn this study, we propose a transfer learning based algorithm for gene prioritization (called TLGP) in the cancer (target domain) without disease-causing genes by transferring knowledge from other cancers (source domain). The underlying assumption is that knowledge shared by similar cancers improves the accuracy of gene prioritization. Specifically, TLGP first quantifies the similarity between the target and source domain by calculating the affinity matrix for genes. Then, TLGP automatically learns a fusion network for the target cancer by fusing affinity matrix, pathogenic genes and genomic data of source cancers. Finally, genes in the target cancer are prioritized. The experimental results indicate that the learnt fusion network is more reliable than gene co-expression network, implying that transferring knowledge from other cancers improves the accuracy of network construction. Moreover, TLGP outperforms state-of-the-art approaches in terms of accuracy, improving at least 5%.ConclusionThe proposed model and method provide an effective and efficient strategy for gene ranking by integrating genomic data from various cancers

    Network-based analysis of eQTL data to prioritize driver mutations

    Get PDF
    In clonal systems, interpreting driver genes in terms of molecular networks helps understanding how these drivers elicit an adaptive phenotype. Obtaining such a network-based understanding depends on the correct identification of driver genes. In clonal systems, independent evolved lines can acquire a similar adaptive phenotype by affecting the same molecular pathways, a phenomenon referred to as parallelism at the molecular pathway level. This implies that successful driver identification depends on interpreting mutated genes in terms of molecular networks. Driver identification and obtaining a network-based understanding of the adaptive phenotype are thus confounded problems that ideally should be solved simultaneously. In this study, a network-based eQTL method is presented that solves both the driver identification and the network-based interpretation problem. As input the method uses coupled genotype-expression phenotype data (eQTL data) of independently evolved lines with similar adaptive phenotypes and an organism-specific genome-wide interaction network. The search for mutational consistency at pathway level is defined as a subnetwork inference problem, which consists of inferring a subnetwork from the genome-wide interaction network that best connects the genes containing mutations to differentially expressed genes. Based on their connectivity with the differentially expressed genes, mutated genes are prioritized as driver genes. Based on semisynthetic data and two publicly available data sets, we illustrate the potential of the network-based eQTL method to prioritize driver genes and to gain insights in the molecular mechanisms underlying an adaptive phenotype. The method is available at http://bioinformatics.intec.ugent.be/phenetic_eqtl/index.htm

    ProSim: A Method for Prioritizing Disease Genes Based on Protein Proximity and Disease Similarity

    Get PDF

    Discovery of tissue specific network properties associated with cancer driver genes

    Get PDF
    Tese de Mestrado em Bioquímica, Faculdade de Ciências, Universidade de Lisboa, 2022Using the notion of disease modules, network medicine has effectively identified diseaseassociated genes in recent years. In biological networks, genes linked to a particular illness tend to interact closely [1]. These networks allow both physical and functional connections between biomolecules to be identified, resulting in a map of cell components and processes that constitute biological systems [2]. Not all disease-associated genes, however, have a major impact on disease phenotype. The discovery of important genes able to produce or change disease phenotype paves the path to new therapies and a personalized medicine strategy. Recent research has found that biological network topological features per se may accurately predict perturbation effects in a dynamical model of the system with a 65-80% accuracy [3, 4]. Biological networks differ depending on whatever tissue or cell type is being studied. As a result, each gene's topological features and ability to impact the system may alter [5]. The main goal of this thesis is to discover network topological parameters associated with influential cancer driver genes using context specific networks. In order to achieve this, we evaluated local network features around each driver gene across multiple tissue specific networks, including tissues that are affected in the disease and others where the gene perturbation has no significant effect. We aimed to identify topological parameters and its characteristics contributing to the cancer driver gene’s influential role. The results of this dissertation point out that several topological parameters can be used to determine cancer “driver” genes. We found that these genes have higher values of topological parameters, such as Degree or Closeness, in tissues where they tend to cause cancer. We also found that this difference is present in oncogenes and tumor suppressor genes. Another factor that we found to influence the value of topological parameters is the number of tissues in which these genes cause the disease. There is an increasing trend of topological parameter values with the increase of the number of tissues in which they cause cancer. Together, these results support the significant association of topological parameters like the Degree with the influential role of a driver gene in cancer.Usando a noção de módulos de doença, a medicina de redes identificou eficazmente nos últimos anos genes associados a doenças. Nas redes biológicas, os genes ligados a uma determinada doença tendem a interagir proximamente [1] . Essas redes permitem que conexões físicas e funcionais entre biomoléculas sejam identificadas, resultando num mapa de componentes celulares e processos que constituem sistemas biológicos [2]. Nem todos os genes associados à doença, no entanto, têm um grande impacto no fenótipo da doença. A descoberta de genes importantes capazes de produzir ou alterar o fenótipo da doença abre caminho para novas terapias e uma estratégia de medicina personalizada. Pesquisas recentes descobriram que as características topológicas da rede biológica podem prever com precisão os efeitos de perturbação num modelo dinâmico do sistema com uma precisão de 65-80% [3, 4]. As redes biológicas diferem dependendo do tipo de tecido ou célula estudado. Como resultado, as características topológicas de cada gene e a capacidade de impactar o sistema podem ser alteradas [5]. O principal objetivo desta dissertação é descobrir parâmetros topológicos de rede associados a genes promotores de cancro usando redes específicas de tecido. Para conseguir isso, avaliamos as características da rede local em torno de cada gene promotor em várias redes específicas de tecidos, incluindo tecidos afetados pela doença e outros onde a perturbação do gene não tem efeito significativo. Deste modo, podemos identificar parâmetros topológicos e as características que contribuem para o papel influente dos genes promotores do cancro. Para atingir os nossos objetivos, começámos por construir e otimizar as nossas redes específicas de tecidos. Cada rede específica de tecido foi construída usando quatro bases de dados diferentes de interações proteína-proteína, vias de sinalização e fatores de transcrição. Tentámos quatro métodos diferentes de construir as redes, incluindo o uso do filtro de níveis de expressão génica acima de 0,1 e 5 transcritos por milhão em cada tecido. Construímos também uma matriz associando os genes promotores de cancro (retirados de uma base de dados online de genes promotores de cancro) aos tecidos onde provocam a doença. Cada gene promotor foi inserido em seis categorias diferentes de acordo com o número de tecidos onde provocam cancro, sendo a categoria seis aquela que inclui os genes que provocam a doença em seis ou mais tecidos. Começámos por comparar os valores dos parâmetros topológicos dos genes em tecidos onde estes provocam a doença versus os seus valores em tecidos onde não a provocam. Esses valores também foram comparados com uma lista de genes associados ao cancro (retirados de uma base de dados online de genes associados a doenças), mas não promotores de cancro, e uma lista de genes não associados a nenhuma doença. Este estudo foi feito sobre os quatro diferentes métodos de construção de rede. Continuámos o estudo observando como os parâmetros topológicos mostraram diferenças ao nível do tecido. Analisámos em cada tecido os valores dos parâmetros topológicos dos genes promotores que causam a doença num determinado tecido versus os valores dos genes que não causam doença naquele tecido. Depois de comparar os valores dos parâmetros topológicos usando todos os genes promotores juntos num grupo global, queríamos verificar se a diferença entre os valores destes nos tecidos onde causam cancro versus os valores nos tecidos onde não provocam a doença, também estava presente dentro das categorias do número de tecidos onde os genes promotores causam cancro e como esses valores aumentam ou diminuem ao longo dessas categorias. Avaliamos em seguida o impacto combinado dos valores dos parâmetros topológicos (selecionando o parâmetro topológico “Degree”) de genes promotores de cancro em tecidos onde causam doença versus onde não causam e também a diferença entre estes ao longo das seis diferentes categorias de número de tecidos onde provocam cancro, usando um Modelo Linear Generalizado (GLM) para avaliar a interação desses fatores. Da base de dados de onde retiramos a lista de genes promotores de cancro, também retiramos uma lista de oncogenes e genes supressores de tumor que usámos para avaliar também as diferenças dos valores dos seus parâmetros topológicos nos tecidos onde causam cancro versus os tecidos onde não causam. A fim de avaliar outras variáveis que possam ter impacto para além dos parâmetros topológicos e que possam também diferir dependendo do número de tecidos onde os genes “drivers” causam a doença, usamos os dados da base de dados de onde retiramos os genes promotores que incluíam informações sobre o número de interações que cada gene promotor estabelece com diferentes miRNA e sobre o número de complexos proteicos que estes genes integram. Também avaliamos o impacto da expressão génica nas diferentes categorias de número de tecidos. Por fim, enriquecemos funcionalmente os genes promotores de cancro, usando dois métodos diferentes. No primeiro método usamos os genes que tinham uma diferença topológica maior (para este estudo usamos apenas o parâmetro topológico “Degree”) entre os tecidos onde causam ou não cancro. Classificamos cada gene como positivo, negativo e não significativo com base na diferença entre o valor médio do “Degree” nos tecidos onde causam cancro versus o valor nos tecidos onde não causam. O segundo método foi o enriquecimento dos diferentes genes promotores de cancro de acordo com o número de tecidos que causam cancro. Fizemos esse estudo usando as diferentes categorias de número de tecidos. Globalmente, os nossos resultados sugerem que os valores dos parâmetros topológicos (por exemplo, “Degree“ e “Closeness”) tendem a ser maiores nos tecidos em que os genes promoteres de cancro provocam a doença ( “Tissue Drivers”), seguidos pelos valores dos genes de cancro que são não promotores de cancro mas estão associados ao desenvolvimento da doença (“Disease Genes”), os valores dos genes promotores de cancro nos tecidos onde não causam cancro (“NonTissueDrivers”) e por último, com os menores valores de parâmetros topológicos, os genes que não estão associados a qualquer doença. A diferença entre os valores dos parâmetros topológicos nos “TissueDrivers” versus “NonTissueDrivers” é estatisticamente significativa na maioria dos parâmetros topológicos testados e nos diferentes métodos de rede utilizados, exceto no método “JustHuRiTPM5Zminmax” (usando apenas a base de dados Huri). Quando analisámos em cada tecido os valores dos parâmetros topológicos, pudemos ver que os valores de “Degree” tendem a ser maiores nos genes promotores de cancro que causam cancro naquele tecido em comparação com os genes promotores que não provocam cancro nesse tecido. Essa diferença é estatisticamente significativa em muitos dos tecidos analisados. Em relação a como os valores dos parâmetros topológicos se comportam ao longo das diferentes categorias associadas ao número de tecidos em que os genes promotores causam cancro, descobrimos que nos genes promotores de cancro que causam doença em apenas em um e dois tecidos, o valor do “Degree” nos tecidos onde causam cancro é menor que o valor apresentado nos tecidos onde não causam cancro. Observamos a tendência inversa nos genes promotores que causam cancro em seis ou mais tecidos (o valor do “Degree” é maior nos tecidos onde causam cancro). Observamos também que o valor do “Degree” aumenta gradativamente ao longo do número da categoria de tecidos, atingindo o valor mais alto na categoria seis (constituída por genes promotores que provocam cancro em seis ou mais tecidos). No modelo linear generalizado (GLM), pudemos ver o efeito combinado da variável tipo de tecido (onde o gene promotor provoca ou não cancro, mostrando uma diferença estatisticamente significativa entre estas duas situações) e da variável número de tecidos onde os genes promotores provocam cancro (mostrando também uma valor estatisticamente significativo entre as diferentes categorias). A interação entre esses dois fatores também foi estatisticamente significativa. Também pudemos observar valores de “Degree” estatisticamente diferentes entre os genes promotores supressores de tumor nos tecidos que causam cancro (com valores mais altos) e os valores nos tecidos onde não causam. Vimos também a mesma diferença nos Oncogenes, mas com menor significância. Os valores do “Degree” nos genes Supressores de Tumores foram inferiores aos valores do “Degree” apresentados pelos Oncogenes. Pudemos igualmente ver uma clara tendência de correlação entre o aumento do número de tecidos com o aumento do número de complexos que os genes promotores de cancro integram. O mesmo comportamento foi observado em relação ao número de miRNAs com os quais os genes promotores interagem. Em relação à expressão do mRNA ao longo das categorias de número de tecidos, pudemos ver uma diferença estatisticamente significativa nas categorias dois e três entre os valores dos genes promotores(em relação ao parâmetro topológico “Degree”) nos tecidos onde causam cancro versus onde não causam. Finalmente, no estudo de enriquecimento de funções pudemos ver que os processos biológicos, funções moleculares e componentes celulares que obtivemos enriquecidos usando o método das diferentes categorias de número de tecidos estão muito mais relacionados com os processos de cancro baseados na literatura (“hallmarks of cancer”). Não conseguimos encontrar uma divisão muito clara entre funções biológicas enriquecidas que tiveram uma diferença de z-score do “Degree” acima de 1 e aqueles com diferença abaixo de -1. Não encontramos nenhum processo de enriquecimento funcional relevante em nenhum desses dois grupos de genes e que de alguma forma os pudesse distinguir entre si. Os resultados desta dissertação apontam para que vários parâmetros topológicos possam estar associados a genes promotores de cancro. Verificámos que estes genes têm valores de parâmetros topológicos, como o Degree ou Closeness, mais elevados nos tecidos onde tendencionalmente provocam cancro. Verificámos também que esta diferença está presente nos oncogenes e nos genes supressores de tumor. Outro fator que verificamos influenciar o valor dos parâmetros topológicos, é o número de tecidos em que estes genes provocam a doença. Há uma tendência crescente do valor topológico com um número de tecidos em que provocam cancro

    Incorporation of Knowledge for Network-based Candidate Gene Prioritization

    Get PDF
    In order to identify the genes associated with a given disease, a number of different high-throughput techniques are available such as gene expression profiles. However, these high-throughput approaches often result in hundreds of different candidate genes, and it is thus very difficult for biomedical researchers to narrow their focus to a few candidate genes when studying a given disease. In order to assist in this challenge, a process called gene prioritization can be utilized. Gene prioritization is the process of identifying and ranking new genes as being associated with a given disease. Candidate genes which rank high are deemed more likely to be associated with the disease than those that rank low. This dissertation focuses on a specific kind of gene prioritization method called network-based gene prioritization. Network-based methods utilize a biological network such as a protein-protein interaction network to rank the candidate genes. In a biological network, a node represents a protein (or gene), and a link represents a biological relationship between two proteins such as a physical interaction. The purpose of this dissertation was to investigate if the incorporation of biological knowledge into the network-based gene prioritization process can provide a significant benefit. The biological knowledge consisted of a variety of information about a given gene including gene ontology (GO) functional terms, MEDLINE articles, gene co-expression measurements, and protein domains to name just a few. The biological knowledge was incorporated into the network’s links and nodes as link and node knowledge respectively. An example of link knowledge is the degree of functional similarity between two proteins, and an example of node knowledge is the number of GO terms associated with a given protein. Since there were no existing network-based inference algorithms which could incorporate node knowledge, I developed a new network-based inference algorithm to incorporate both link and node knowledge called the Knowledge Network Gene Prioritization (KNGP) algorithm. The results showed that the incorporation of biological knowledge via link and node knowledge can provide a significant benefit for network-based gene prioritization. The KNGP algorithm was utilized to combine the link and node knowledge

    Context matters:the power of single-cell analyses in identifying context-dependent effects on gene expression in blood immune cells

    Get PDF
    The human immune system is a complex system that we still do not fully understand. No two humans react in the same way to attacks by bacteria, viruses or fungi. Factors such as genetics, the type of pathogen or previous exposure to the pathogen may explain this diversity in response. Single-cell RNA sequencing (scRNA-seq) is a new technique that enables us to study the gene expression of each cell individually, allowing us to study immune diversity in much greater detail. This increased resolution helps us discern how disease-associated genetic variants actually contribute to disease. In this thesis, I studied the relation between disease-associated genetic variants and gene expression levels in the context of different cell types and pathogen exposures in order to gain insight into the working mechanisms of these variants. For many variants we learnt in which cell types and under which pathogen exposures they affect gene expression, and we were even able to identify changes in gene co-expression, suggesting that disease-associated variants change how our genes interact with each other. With the single-cell field being so new, much of my work was showing the feasibility of using scRNA-seq to study the interplay between genetics and gene expression. To set up future research, we created guidelines for these analyses and established a consortium that brings together many major scientists in the field to enable large-scale studies across an even wider variety of contexts. This final work helps inform current and future large-scale scRNA-seq research

    Network-based identification of driver pathways in clonal systems

    Get PDF
    Highly ethanol-tolerant bacteria for the production of biofuels, bacterial pathogenes which are resistant to antibiotics and cancer cells are examples of phenotypes that are of importance to society and are currently being studied. In order to better understand these phenotypes and their underlying genotype-phenotype relationships it is now commonplace to investigate DNA and expression profiles using next generation sequencing (NGS) and microarray techniques. These techniques generate large amounts of omics data which result in lists of genes that have mutations or expression profiles which potentially contribute to the phenotype. These lists often include a multitude of genes and are troublesome to verify manually as performing literature studies and wet-lab experiments for a large number of genes is very time and resources consuming. Therefore, (computational) methods are required which can narrow these gene lists down by removing generally abundant false positives from these lists and can ideally provide additional information on the relationships between the selected genes. Other high-throughput techniques such as yeast two-hybrid (Y2H), ChIP-Seq and Chip-Chip but also a myriad of small-scale experiments and predictive computational methods have generated a treasure of interactomics data over the last decade, most of which is now publicly available. By combining this data into a biological interaction network, which contains all molecular pathways that an organisms can utilize and thus is the equivalent of the blueprint of an organisms, it is possible to integrate the omics data obtained from experiments with these biological interaction networks. Biological interaction networks are key to the computational methods presented in this thesis as they enables methods to account for important relations between genes (and gene products). Doing so it is possible to not only identify interesting genes but also to uncover molecular processes important to the phenotype. As the best way to analyze omics data from an interesting phenotype varies widely based on the experimental setup and the available data, multiple methods were developed and applied in the context of this thesis: In a first approach, an existing method (PheNetic) was applied to a consortium of three bacterial species that together are able to efficiently degrade a herbicide but none of the species are able to efficiently degrade the herbicide on their own. For each of the species expression data (RNA-seq) was generated for the consortium and the species in isolation. PheNetic identified molecular pathways which were differentially expressed and likely contribute to a cross-feeding mechanism between the species in the consortium. Having obtained proof-of-concept, PheNetic was adapted to cope with experimental evolution datasets in which, in addition to expression data, genomics data was also available. Two publicly available datasets were analyzed: Amikacin resistance in E. coli and coexisting ecotypes in E.coli. The results allowed to elicit well-known and newly found molecular pathways involved in these phenotypes. Experimental evolution sometimes generates datasets consisting of mutator phenotypes which have high mutation rates. These datasets are hard to analyze due to the large amount of noise (most mutations have no effect on the phenotype). To this end IAMBEE was developed. IAMBEE is able to analyze genomic datasets from evolution experiments even if they contain mutator phenotypes. IAMBEE was tested using an E. coli evolution experiment in which cells were exposed to increasing concentrations of ethanol. The results were validated in the wet-lab. In addition to methods for analysis of causal mutations and mechanisms in bacteria, a method for the identification of causal molecular pathways in cancer was developed. As bacteria and cancerous cells are both clonal, they can be treated similar in this context. The big differences are the amount of data available (many more samples are available in cancer) and the fact that cancer is a complex and heterogenic phenotype. Therefore we developed SSA-ME, which makes use of the concept that a causal molecular pathway has at most one mutation in a cancerous cell (mutual exclusivity). However, enforcing this criterion is computationally hard. SSA-ME is designed to cope with this problem and search for mutual exclusive patterns in relatively large datasets. SSA-ME was tested on cancer data from the TCGA PAN-cancer dataset. From the results we could, in addition to already known molecular pathways and mutated genes, predict the involvement of few rarely mutated genes.nrpages: 246status: publishe

    System-Level Analysis of Alzheimer\u27s Disease Prioritizes Candidate Genes for Neurodegeneration.

    Get PDF
    Alzheimer\u27s disease (AD) is a debilitating neurodegenerative disorder. Since the advent of the genome-wide association study (GWAS) we have come to understand much about the genes involved in AD heritability and pathophysiology. Large case-control meta-GWAS studies have increased our ability to prioritize weaker effect alleles, while the recent development of network-based functional prediction has provided a mechanism by which we can use machine learning to reprioritize GWAS hits in the functional context of relevant brain tissues like the hippocampus and amygdala. In parallel with these developments, groups like the Alzheimer\u27s Disease Neuroimaging Initiative (ADNI) have compiled rich compendia of AD patient data including genotype and biomarker information, including derived volume measures for relevant structures like the hippocampus and the amygdala. In this study we wanted to identify genes involved in AD-related atrophy of these two structures, which are often critically impaired over the course of the disease. To do this we developed a combined score prioritization method which uses the cumulative distribution function of a gene\u27s functional and positional score, to prioritize top genes that not only segregate with disease status, but also with hippocampal and amygdalar atrophy. Our method identified a mix of genes that had previously been identified in AD GWAS including APOE, TOMM40, and NECTIN2(PVRL2) and several others that have not been identified in AD genetic studies, but play integral roles in AD-effected functional pathways including IQSEC1, PFN1, and PAK2. Our findings support the viability of our novel combined score as a method for prioritizing region- and even cell-specific AD risk genes

    To Transformers and Beyond: Large Language Models for the Genome

    Full text link
    In the rapidly evolving landscape of genomics, deep learning has emerged as a useful tool for tackling complex computational challenges. This review focuses on the transformative role of Large Language Models (LLMs), which are mostly based on the transformer architecture, in genomics. Building on the foundation of traditional convolutional neural networks and recurrent neural networks, we explore both the strengths and limitations of transformers and other LLMs for genomics. Additionally, we contemplate the future of genomic modeling beyond the transformer architecture based on current trends in research. The paper aims to serve as a guide for computational biologists and computer scientists interested in LLMs for genomic data. We hope the paper can also serve as an educational introduction and discussion for biologists to a fundamental shift in how we will be analyzing genomic data in the future

    Biological systems on a small scale

    Get PDF